Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 053: Filtering by contributor, genre, and subject #82

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

alexwlchan
Copy link
Contributor

@alexwlchan alexwlchan commented Jun 16, 2022

(Leaving 051 because I saw Jamie take it yesterday, 052 for the concepts pipeline)

Rendered version: https://github.com/wellcomecollection/docs/blob/052-filtering-by-subject/rfcs/052-filtering-by-subject/README.md

@alexwlchan alexwlchan changed the title RFC 052: Filtering by contributor, genre, and subject RFC 053: Filtering by contributor, genre, and subject Jun 16, 2022
Comment on lines +24 to +30
5. Given a single work in the works API, there should be an obvious way to construct a filter URL for works with the same subjects/contributors/genres as this work.

6. The catalogues are the source of truth for subject identifiers.
We can find equivalent identifiers, but we can't pick them from scratch.

e.g. If a Sierra record has a subject tagged with an LCSH identifier, we can find the Wikidata subject with that identifier.
If a Sierra record has a subject with no identifier, we can't choose an identifier, even if we could find a Wikidata subject with a matching label.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These requirements might seem oddly specific; I added them as a way to negate some of the more wild ideas I had. (e.g. have "shadow identifiers" for unidentified concepts, but then it's impossible for anyone to work out how to filter!)

They did lead me towards the suggestion in "An idea".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that if something is unidentified, we shouldn't make up an id. It should remain unidentified and that may mean that we have to treat it differently, eg the difference between identified and unidentified series on works pages. But it is being true to the data. We are more likely to get into a pickle if we don't go with the grain of what we have.


Questions:

* Are the requirements as stated correct?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general response to the problem stated: it seems to me that the shortcomings of each approach (ID vs label) are kind of related to each other. That is to say - I think the existence of unidentifiable concepts in some sense implies an inability to disambiguate concepts by label.

In reality it's slightly more complex than that - as you say, with family names it's realistic to end up with multiple ambiguous and identified concepts. Still, I'm not sure there's really a problem per se...

  1. On per-concept pages, there's a sample of matching images/works.

Concept pages are for identified concepts, this is fine with the ID search

  1. On per-concept pages, there's a link to a filtered search for the given concept, directly below the sample results.

I'm not sure what this requirement means - is a "search for the given concept" a works/image search? Again, I feel that it's fine for this only to be for identified concepts.

  1. On work pages, the list of subjects/contributors/genres to link to filtered searches for each subject.

I wonder if an approach here would be to link to a label search for an unidentified concept, and an ID search for an identified concept. Perhaps there could be UI signifiers that we were doing a text search rather than searching for some kind of canonical entity.

  1. In the works API, there are filters and aggregations for subject/contributor/genre.

Think this is implied by the above 3 requirements.

  1. Given a single work in the works API, there should be an obvious way to construct a filter URL for works with the same subjects/contributors/genres as this work.

As with 3 but agreed it isn't necessarily obvious.

  1. The catalogues are the source of truth for subject identifiers.

For sure.

Appendix

I think I might have talked myself into (via responses to 3 and 5) some kind of concept filter that accepts IDs and labels? Is that an unacceptable deviation? It would behave much like the existing works search when searching for an ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this requirement means - is a "search for the given concept" a works/image search? Again, I feel that it's fine for this only to be for identified concepts.

Yeah, I can make that clearer. It's the "there are 5 images above, click here to see all the images" button.

I wanted to state 4 explicitly for the "API-as-product" requirement.

How do dual filters work in the front-end? Does the front-end combine aggregations for identified/unidentified subjects and present them as one list, and then different ticky boxes apply different filters?

The idea of a filter that does both is interesting, and makes me think of another question: do we need a way to find specifically unlabelled subjects? e.g. if I'm on a work page with an unidentified subject "Mental health", is it okay to link to a search that includes the LCSH-identified subject "Mental health" as well as unidentified subjects?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it would be better named a "concepts query" rather than a "concepts filter" in this case. I'm going off it already tbh - I'm not sure how the aggregations would work, although I think that would be the expected behaviour in this case.

Re search for unidentified concept labels returning identified concepts, I think that's fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the general gist of what Jamie is saying above. Concept pages are for identified concepts, if it's not identified we don't give it a page. We tend towards more identified concepts over time (and it is the vast majority already, genres aside). For genres that means, no concept pages but a continuation of what we have today.

The caveat to the above is the potential mixing of user flows. If all genres do one thing and all subjects do another, that's probably ok. If it's less deterministic than that we may have a problem.

I would like @GarethOrmerod to be involved in this conversation, as I think we probably do need to treat linking to concept pages and back to searches for labels differently when we link to them from works pages, to make sure it's clear to user where's they're going to go in each case.

@alexwlchan alexwlchan self-assigned this Jun 17, 2022
@jtweed jtweed self-requested a review June 17, 2022 13:13
rfcs/053-filtering-by-subject/README.md Outdated Show resolved Hide resolved
Comment on lines +24 to +30
5. Given a single work in the works API, there should be an obvious way to construct a filter URL for works with the same subjects/contributors/genres as this work.

6. The catalogues are the source of truth for subject identifiers.
We can find equivalent identifiers, but we can't pick them from scratch.

e.g. If a Sierra record has a subject tagged with an LCSH identifier, we can find the Wikidata subject with that identifier.
If a Sierra record has a subject with no identifier, we can't choose an identifier, even if we could find a Wikidata subject with a matching label.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that if something is unidentified, we shouldn't make up an id. It should remain unidentified and that may mean that we have to treat it differently, eg the difference between identified and unidentified series on works pages. But it is being true to the data. We are more likely to get into a pickle if we don't go with the grain of what we have.

* We can't rely on filtering by label, because there may be concepts with similar/identical labels but which refer to different things.
e.g. two members of the same family.

* We can't rely on filtering by ID, because not all subjects/genres/contributors are identified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can filter by id when we know it is an identified subject, I think I'm with Jamie and the direction he's heading below. Concept pages are for identified things. If they aren't identified, they can't have a page.

* We can't rely on filtering by ID, because not all subjects/genres/contributors are identified.

From an API perspective, we could easily support both, but it's more complicated in the front-end.
How do we choose which filter mechanism to use?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By whether the subject is identified or not.


Questions:

* Are the requirements as stated correct?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the general gist of what Jamie is saying above. Concept pages are for identified concepts, if it's not identified we don't give it a page. We tend towards more identified concepts over time (and it is the vast majority already, genres aside). For genres that means, no concept pages but a continuation of what we have today.

The caveat to the above is the potential mixing of user flows. If all genres do one thing and all subjects do another, that's probably ok. If it's less deterministic than that we may have a problem.

I would like @GarethOrmerod to be involved in this conversation, as I think we probably do need to treat linking to concept pages and back to searches for labels differently when we link to them from works pages, to make sure it's clear to user where's they're going to go in each case.

* Are the requirements as stated correct?
Are there any missing or unnecessary requirements?

* How many identified/unidentified concepts are there in the catalogue?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harrisonpim may also want to weigh in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went and had a look in a recent snapshot.

unique valuestotal occurrences
unidentified subjects;
label only
120,301556,331
identified subjects;
canonical ID
104,4811,540,795

rfcs/053-filtering-by-subject/README.md Outdated Show resolved Hide resolved
@alexwlchan
Copy link
Contributor Author

Reading your comments, I don't think I did a very good job of explaining this.

I agree this is one @GarethOrmerod should weigh in on – from an API perspective, supporting both ID and label for filters/aggregations isn't difficult. I think the tricky part is how those filters are presented in the front-end, because I'm not sure how we do it consistently.

This may be a non-issue if the vast majority of subjects are identified; I don't know if that's the case.

I've updated the RFC with examples of two UI flows where I see potential snarls; hopefully that makes it a bit clearer.


Q: How do we distinguish between a filter for the label "mental health" and the identified concept?

Q: How do we distinguish between two identified concepts with the same label?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding Concept pages, We've mentioned things like Date of Birth and a brief description being used as a disambiguation hint. Perhaps such a thing could be short enough to be parenthesised, or that we also ensure that our Concept labels are unique by that mechanism: e.g. Nightingale (Bird) vs Nightingale (Statistician).

Wikipedia URLs are by name, rather than opaque id, and this is how they have solved the problem.

If we are linking via ID, then Concept labels can be mutable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think concepts' preferred labels have to be mutable; the initial motivation for all of this work was a desire to update outdated language, and that's never going to stop being a problem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. What I mean by that is that we have an advantage over Wikipedia in that no one elsewhere will be linking to a page like wellcomecollection.org/Florence_Nightingale and getting cross because we change it to wellcomecollection.org/Nightingale,_Florence.

Their solution is good enough for them even with their disadvantages (The Georgia Problem), it would (IMO) work even better for us.

They click the dropdown to see a list of available subject filters, and pick one.

Q: Is the list of available subjects based on ID or label?
Is it a mixture of both?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to do some analysis on the degree to which this is a problem. If there are only a handful of unidentified subjects, then we could endeavour to get them identified and work only on IDs.

If, OTOH, there are a substantial number of distinct unidentified subjects, then we may have to mix them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another bit of analysis that would be useful:

How many concept labels do we actually have that fit the problematic criteria:

  • apply to more than one Concept
  • are not already disambiguated by IDs in the catalogue
  • are common enough in the catalogue that it could not be resolved by a limited and reasonable amount of cataloguing effort.

"John Nash" might be a good example to investigate. There are at least three John Nashes mentioned in the first page of results here: https://wellcomecollection.org/works?query=John+nash

Q: Is the list of available subjects based on ID or label?
Is it a mixture of both?

Q: How do we distinguish between a filter for the label "mental health" and the identified concept?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect (as long as everything is identified correctly) that what someone would be asking for when filtering by the label is more like a contains query than an equals query. i.e. they don't remember exactly what the label is, but it was something like X.

If we are in a situation where we have to mix label and id in a UI, because there are too many unidentifiable labels, then that can only be seamlessly concealed from the user as long as there are no clashes (e.g. there are no label: "Mental Health" records that do not refer to the identified concept "Mental Health")

Copy link
Contributor

@harrisonpim harrisonpim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In response to general discussion above, I'm going to go against the grain and say that I think we should be minting IDs for unidentified concepts, just like identified ones.

I think the proposed inconsistency in user experience between identified and unidentified concepts sounds Bad, and I can't really see a proper justification for it. We're introducing a lot of tricky technical work for ourselves through that distinction, and I can't see who it's helping. I feel like the same amount of work could be better spent on the careful minting/merging/redirecting etc of concept IDs, while providing a more consistent experience for users.

Even if the majority of subjects are identified, we still have a lot of unidentified things (subjects, people, genres) and I don't think users should have to worry about whether we've managed to enrich the thing that they've clicked on - they'll expect the experience to be the same. Again, coming up with ways of hinting to users that some concepts are identified and some aren't through design just sounds like extra work for us and our users, and I can't see who it helps.

It's also been said elsewhere that we shouldn't introduce new data to what's in the catalogue, and I agree - I just don't think we're introducing anything new by IDing a string which already exists.
If anything, I think consistency provides a more honest reflection of what's in the catalogue.
In that sense, a bare concepts page also feels like a better nudge for folks in collections info to identify those weird, dangling concepts, compared to a page of search results. It seems easier to me to imagine enriching/disambiguating an existing, sparse concept page than the leap from filtered search results to full concepts page.

@paul-butcher
Copy link
Contributor

In that sense, a bare concepts page also feels like a better nudge for folks in collections info to identify those weird, dangling concepts, compared to a page of search results.

Also, finding something attached to the "wrong" Concept would be a better nudge to get it attached to the "right" one.

I feel like the same amount of work could be better spent on the careful minting/merging/redirecting etc of concept IDs, while providing a more consistent experience for users.

My main concern with the "mint everything" approach is that we might end up fragmenting Concepts due to subtle differences in labels (perhaps reflecting a change in practice over time). If that is just down to a typo in a few records (e.g. Falliaze vs Fallaize), then that's a clear error that's simple enough to sort out. OTOH, if half the records are "Florence Nightingale" and the other half are "Nightingale, Florence", and it's left unresolved for too long, users might start collecting bookmarks to the "wrong" one, and we'll need to do some redirecting when it is resolved. As you say, focusing effort on merging/redirecting from the start might be a better approach, because it's probably inevitable.

@jtweed jtweed mentioned this pull request Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
5 participants