Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 053: Filtering by contributor, genre, and subject #82

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions rfcs/037-api-faceting-principles/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# API faceting principles & expectations
# RFC 037: API faceting principles & expectations

**Status:** Draft

Expand Down Expand Up @@ -63,7 +63,7 @@ and an aggregation on the labels would be:
http://host.name/path/docs?aggregations=a.b.label
```

**3. Aggregations are returned in an `aggregations` field, with the same name by which they were requested**
**3. Aggregations are returned in an `aggregations` field, with the same name by which they were requested**

This means JSON paths are still represented as strings, rather than being expanded. For example, the response to the previous example would include at the top level

Expand Down Expand Up @@ -139,7 +139,7 @@ But if a separate (non-paired) filter was applied that happened to exclude the `

**6. When a filter and its paired aggregation are both applied, the bucket corresponding to the filtered value is always present**

Explicitly: even if other filters or queries are present which cause a bucket which currently has an applied filter to be empty (ie, it has a count of 0), it still appears in the aggregation. This is necessary so that the interface for the filter can still be rendered.
Explicitly: even if other filters or queries are present which cause a bucket which currently has an applied filter to be empty (ie, it has a count of 0), it still appears in the aggregation. This is necessary so that the interface for the filter can still be rendered.

**7. Aggregations on fields contained in sum types return buckets of the type's components**

Expand Down
85 changes: 85 additions & 0 deletions rfcs/053-filtering-by-subject/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# RFC 053: Filtering by contributor, genre, and subject

For per-subject and per-person pages, we want to filter for images and works that match a given subject/person.
For example:

* Works by Florence Nightingale
* Works about Charles Darwin
* Images about mental health

This turns out to be non-trivial, so this RFC describes how we'll get there.

Strictly speaking we only need subjects and contributors for this work, but genres are so similar we should treat them in the same way.

## Requirements

1. On per-concept pages, there's a sample of matching images/works.

2. On per-concept pages, there's a link to a filtered search for the identified concept, directly below the sample results.

3. On work pages, the list of subjects/contributors/genres link to:

- a concept page if the subject/contributor is identified (new behaviour)
- a filtered search by label if the subject/contributor/genre is unidentified (existing behaviour)

4. In the works API, there are filters and aggregations for subject/contributor/genre that support (3).

5. Given a single work in the works API, there should be an obvious way to construct a filter URL for works with the same subjects/contributors/genres as this work.

6. The catalogues are the source of truth for subject identifiers.
We can find equivalent identifiers, but we can't pick them from scratch.

e.g. If a Sierra record has a subject tagged with an LCSH identifier, we can find the Wikidata subject with that identifier.
If a Sierra record has a subject with no identifier, we can't choose an identifier, even if we could find a Wikidata subject with a matching label.
Comment on lines +27 to +33
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These requirements might seem oddly specific; I added them as a way to negate some of the more wild ideas I had. (e.g. have "shadow identifiers" for unidentified concepts, but then it's impossible for anyone to work out how to filter!)

They did lead me towards the suggestion in "An idea".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that if something is unidentified, we shouldn't make up an id. It should remain unidentified and that may mean that we have to treat it differently, eg the difference between identified and unidentified series on works pages. But it is being true to the data. We are more likely to get into a pickle if we don't go with the grain of what we have.


## Current behaviour

We have filters and aggregations for *label*, not ID.

## Considerations for future behaviour

If we add filtering/aggregations for subjects by ID, we already know how they'll be named: `subjects` and `source.subjects` for works and images, respectively.
This is consistent with our existing API design.

But do we add filtering/aggregations for subjects by ID?
How do we handle this in the front-end?

Consider the following flows in the front-end:

1. A user lands on the concept page for "mental health".

This includes a list of works with that identified concept.
When they click to see the full list of works, they should see filtered search results.

This filtered search must use ID filtering, because there may be concepts with similar/identical labels but which refer to different things.
e.g. two members of the same family.

Q: How do we distinguish this in the UI from a label search for "mental health"?

Q: Should a user be able to discover this filter through the search UI?
If they remove the filter, can they re-add it without going via the concepts page?

2. A user is on a search page.
They want to filter by subject.
They click the dropdown to see a list of available subject filters, and pick one.

Q: Is the list of available subjects based on ID or label?
Is it a mixture of both?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to do some analysis on the degree to which this is a problem. If there are only a handful of unidentified subjects, then we could endeavour to get them identified and work only on IDs.

If, OTOH, there are a substantial number of distinct unidentified subjects, then we may have to mix them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another bit of analysis that would be useful:

How many concept labels do we actually have that fit the problematic criteria:

  • apply to more than one Concept
  • are not already disambiguated by IDs in the catalogue
  • are common enough in the catalogue that it could not be resolved by a limited and reasonable amount of cataloguing effort.

"John Nash" might be a good example to investigate. There are at least three John Nashes mentioned in the first page of results here: https://wellcomecollection.org/works?query=John+nash


Q: How do we distinguish between a filter for the label "mental health" and the identified concept?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect (as long as everything is identified correctly) that what someone would be asking for when filtering by the label is more like a contains query than an equals query. i.e. they don't remember exactly what the label is, but it was something like X.

If we are in a situation where we have to mix label and id in a UI, because there are too many unidentifiable labels, then that can only be seamlessly concealed from the user as long as there are no clashes (e.g. there are no label: "Mental Health" records that do not refer to the identified concept "Mental Health")


Q: How do we distinguish between two identified concepts with the same label?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding Concept pages, We've mentioned things like Date of Birth and a brief description being used as a disambiguation hint. Perhaps such a thing could be short enough to be parenthesised, or that we also ensure that our Concept labels are unique by that mechanism: e.g. Nightingale (Bird) vs Nightingale (Statistician).

Wikipedia URLs are by name, rather than opaque id, and this is how they have solved the problem.

If we are linking via ID, then Concept labels can be mutable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think concepts' preferred labels have to be mutable; the initial motivation for all of this work was a desire to update outdated language, and that's never going to stop being a problem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. What I mean by that is that we have an advantage over Wikipedia in that no one elsewhere will be linking to a page like wellcomecollection.org/Florence_Nightingale and getting cross because we change it to wellcomecollection.org/Nightingale,_Florence.

Their solution is good enough for them even with their disadvantages (The Georgia Problem), it would (IMO) work even better for us.


Questions:

* Are the requirements as stated correct?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general response to the problem stated: it seems to me that the shortcomings of each approach (ID vs label) are kind of related to each other. That is to say - I think the existence of unidentifiable concepts in some sense implies an inability to disambiguate concepts by label.

In reality it's slightly more complex than that - as you say, with family names it's realistic to end up with multiple ambiguous and identified concepts. Still, I'm not sure there's really a problem per se...

  1. On per-concept pages, there's a sample of matching images/works.

Concept pages are for identified concepts, this is fine with the ID search

  1. On per-concept pages, there's a link to a filtered search for the given concept, directly below the sample results.

I'm not sure what this requirement means - is a "search for the given concept" a works/image search? Again, I feel that it's fine for this only to be for identified concepts.

  1. On work pages, the list of subjects/contributors/genres to link to filtered searches for each subject.

I wonder if an approach here would be to link to a label search for an unidentified concept, and an ID search for an identified concept. Perhaps there could be UI signifiers that we were doing a text search rather than searching for some kind of canonical entity.

  1. In the works API, there are filters and aggregations for subject/contributor/genre.

Think this is implied by the above 3 requirements.

  1. Given a single work in the works API, there should be an obvious way to construct a filter URL for works with the same subjects/contributors/genres as this work.

As with 3 but agreed it isn't necessarily obvious.

  1. The catalogues are the source of truth for subject identifiers.

For sure.

Appendix

I think I might have talked myself into (via responses to 3 and 5) some kind of concept filter that accepts IDs and labels? Is that an unacceptable deviation? It would behave much like the existing works search when searching for an ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this requirement means - is a "search for the given concept" a works/image search? Again, I feel that it's fine for this only to be for identified concepts.

Yeah, I can make that clearer. It's the "there are 5 images above, click here to see all the images" button.

I wanted to state 4 explicitly for the "API-as-product" requirement.

How do dual filters work in the front-end? Does the front-end combine aggregations for identified/unidentified subjects and present them as one list, and then different ticky boxes apply different filters?

The idea of a filter that does both is interesting, and makes me think of another question: do we need a way to find specifically unlabelled subjects? e.g. if I'm on a work page with an unidentified subject "Mental health", is it okay to link to a search that includes the LCSH-identified subject "Mental health" as well as unidentified subjects?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it would be better named a "concepts query" rather than a "concepts filter" in this case. I'm going off it already tbh - I'm not sure how the aggregations would work, although I think that would be the expected behaviour in this case.

Re search for unidentified concept labels returning identified concepts, I think that's fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the general gist of what Jamie is saying above. Concept pages are for identified concepts, if it's not identified we don't give it a page. We tend towards more identified concepts over time (and it is the vast majority already, genres aside). For genres that means, no concept pages but a continuation of what we have today.

The caveat to the above is the potential mixing of user flows. If all genres do one thing and all subjects do another, that's probably ok. If it's less deterministic than that we may have a problem.

I would like @GarethOrmerod to be involved in this conversation, as I think we probably do need to treat linking to concept pages and back to searches for labels differently when we link to them from works pages, to make sure it's clear to user where's they're going to go in each case.

Are there any missing or unnecessary requirements?

* How many identified/unidentified concepts are there in the catalogue?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harrisonpim may also want to weigh in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went and had a look in a recent snapshot.

unique valuestotal occurrences
unidentified subjects;
label only
120,301556,331
identified subjects;
canonical ID
104,4811,540,795


* How do we want to approach this filtering?

## See also

* [RFC 008](../008-api-filtering): API Filtering
* [RFC 037](../037-api-faceting-principles): API faceting principles & expectations
21 changes: 21 additions & 0 deletions rfcs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,20 @@ When an RFC is merged it provides a guide to implementing that change when it is
## Table of contents

<dl>
<dt>
<a href="./008-api-filtering">RFC 008</a>: API Filtering
</dt>
<dd>
Defining a set of patterns for filtering and sorting in the catalogue API.
</dd>

<dt>
<a href="./037-api-faceting-principles">RFC 037</a>: API faceting principles & expectations
</dt>
<dd>
Standards for filtering and aggregations in the catalogue API, including naming and response types.
</dd>

<dt>
<a href="./047-catalogue-api-index-structure">RFC 047</a>: Changing the structure of the Catalogue API index
</dt>
Expand All @@ -33,4 +47,11 @@ When an RFC is merged it provides a guide to implementing that change when it is
<dd>
Some discussion about how we might model subjects and people in the concepts API.
</dd>

<dt>
<a href="./053-filtering-by-subject">RFC 053</a>: Filtering by contributor, genre, and subject
</dt>
<dd>
How we'll support filtering for per-subject pages as part of the concepts work.
</dd>
</dl>