Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 053: Filtering by contributor, genre, and subject #82

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions rfcs/037-api-faceting-principles/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# API faceting principles & expectations
# RFC 037: API faceting principles & expectations

**Status:** Draft

Expand Down Expand Up @@ -63,7 +63,7 @@ and an aggregation on the labels would be:
http://host.name/path/docs?aggregations=a.b.label
```

**3. Aggregations are returned in an `aggregations` field, with the same name by which they were requested**
**3. Aggregations are returned in an `aggregations` field, with the same name by which they were requested**

This means JSON paths are still represented as strings, rather than being expanded. For example, the response to the previous example would include at the top level

Expand Down Expand Up @@ -139,7 +139,7 @@ But if a separate (non-paired) filter was applied that happened to exclude the `

**6. When a filter and its paired aggregation are both applied, the bucket corresponding to the filtered value is always present**

Explicitly: even if other filters or queries are present which cause a bucket which currently has an applied filter to be empty (ie, it has a count of 0), it still appears in the aggregation. This is necessary so that the interface for the filter can still be rendered.
Explicitly: even if other filters or queries are present which cause a bucket which currently has an applied filter to be empty (ie, it has a count of 0), it still appears in the aggregation. This is necessary so that the interface for the filter can still be rendered.

**7. Aggregations on fields contained in sum types return buckets of the type's components**

Expand Down
84 changes: 84 additions & 0 deletions rfcs/053-filtering-by-subject/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# RFC 053: Filtering by contributor, genre, and subject

For per-subject and per-person pages, we want to filter for images and works that match a given subject/person.
For example:

* Works by Florence Nightingale
* Works about Charles Darwin
* Images about mental health

This turns out to be non-trivial, so this RFC describes how we'll get there.

Strictly speaking we only need subjects and contributors for this work, but genres are so similar we should treat them in the same way.

## Requirements

1. On per-concept pages, there's a sample of matching images/works.

2. On per-concept pages, there's a link to a filtered search for the given concept, directly below the sample results.

3. On work pages, the list of subjects/contributors/genres to link to filtered searches for each subject.
alexwlchan marked this conversation as resolved.
Show resolved Hide resolved

4. In the works API, there are filters and aggregations for subject/contributor/genre.

5. Given a single work in the works API, there should be an obvious way to construct a filter URL for works with the same subjects/contributors/genres as this work.

6. The catalogues are the source of truth for subject identifiers.
We can find equivalent identifiers, but we can't pick them from scratch.

e.g. If a Sierra record has a subject tagged with an LCSH identifier, we can find the Wikidata subject with that identifier.
If a Sierra record has a subject with no identifier, we can't choose an identifier, even if we could find a Wikidata subject with a matching label.
Comment on lines +27 to +33
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These requirements might seem oddly specific; I added them as a way to negate some of the more wild ideas I had. (e.g. have "shadow identifiers" for unidentified concepts, but then it's impossible for anyone to work out how to filter!)

They did lead me towards the suggestion in "An idea".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that if something is unidentified, we shouldn't make up an id. It should remain unidentified and that may mean that we have to treat it differently, eg the difference between identified and unidentified series on works pages. But it is being true to the data. We are more likely to get into a pickle if we don't go with the grain of what we have.


## Current behaviour

We have filters and aggregations for *label*, not ID.

## Future behaviour

If we add filtering/aggregations for subjects by ID, we already know how they'll be named: `subjects` and `source.subjects` for works and images, respectively.
This is consistent with our existing API design.

But do we add filtering/aggregations for subjects by ID?
We find ourselves in a dilemma:

* We can't rely on filtering by label, because there may be concepts with similar/identical labels but which refer to different things.
e.g. two members of the same family.

* We can't rely on filtering by ID, because not all subjects/genres/contributors are identified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can filter by id when we know it is an identified subject, I think I'm with Jamie and the direction he's heading below. Concept pages are for identified things. If they aren't identified, they can't have a page.


From an API perspective, we could easily support both, but it's more complicated in the front-end.
How do we choose which filter mechanism to use?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By whether the subject is identified or not.

How does an API client choose which filter mechanism to use?

Questions:

* Are the requirements as stated correct?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general response to the problem stated: it seems to me that the shortcomings of each approach (ID vs label) are kind of related to each other. That is to say - I think the existence of unidentifiable concepts in some sense implies an inability to disambiguate concepts by label.

In reality it's slightly more complex than that - as you say, with family names it's realistic to end up with multiple ambiguous and identified concepts. Still, I'm not sure there's really a problem per se...

  1. On per-concept pages, there's a sample of matching images/works.

Concept pages are for identified concepts, this is fine with the ID search

  1. On per-concept pages, there's a link to a filtered search for the given concept, directly below the sample results.

I'm not sure what this requirement means - is a "search for the given concept" a works/image search? Again, I feel that it's fine for this only to be for identified concepts.

  1. On work pages, the list of subjects/contributors/genres to link to filtered searches for each subject.

I wonder if an approach here would be to link to a label search for an unidentified concept, and an ID search for an identified concept. Perhaps there could be UI signifiers that we were doing a text search rather than searching for some kind of canonical entity.

  1. In the works API, there are filters and aggregations for subject/contributor/genre.

Think this is implied by the above 3 requirements.

  1. Given a single work in the works API, there should be an obvious way to construct a filter URL for works with the same subjects/contributors/genres as this work.

As with 3 but agreed it isn't necessarily obvious.

  1. The catalogues are the source of truth for subject identifiers.

For sure.

Appendix

I think I might have talked myself into (via responses to 3 and 5) some kind of concept filter that accepts IDs and labels? Is that an unacceptable deviation? It would behave much like the existing works search when searching for an ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this requirement means - is a "search for the given concept" a works/image search? Again, I feel that it's fine for this only to be for identified concepts.

Yeah, I can make that clearer. It's the "there are 5 images above, click here to see all the images" button.

I wanted to state 4 explicitly for the "API-as-product" requirement.

How do dual filters work in the front-end? Does the front-end combine aggregations for identified/unidentified subjects and present them as one list, and then different ticky boxes apply different filters?

The idea of a filter that does both is interesting, and makes me think of another question: do we need a way to find specifically unlabelled subjects? e.g. if I'm on a work page with an unidentified subject "Mental health", is it okay to link to a search that includes the LCSH-identified subject "Mental health" as well as unidentified subjects?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it would be better named a "concepts query" rather than a "concepts filter" in this case. I'm going off it already tbh - I'm not sure how the aggregations would work, although I think that would be the expected behaviour in this case.

Re search for unidentified concept labels returning identified concepts, I think that's fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the general gist of what Jamie is saying above. Concept pages are for identified concepts, if it's not identified we don't give it a page. We tend towards more identified concepts over time (and it is the vast majority already, genres aside). For genres that means, no concept pages but a continuation of what we have today.

The caveat to the above is the potential mixing of user flows. If all genres do one thing and all subjects do another, that's probably ok. If it's less deterministic than that we may have a problem.

I would like @GarethOrmerod to be involved in this conversation, as I think we probably do need to treat linking to concept pages and back to searches for labels differently when we link to them from works pages, to make sure it's clear to user where's they're going to go in each case.

Are there any missing or unnecessary requirements?

* How many identified/unidentified concepts are there in the catalogue?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harrisonpim may also want to weigh in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went and had a look in a recent snapshot.

unique valuestotal occurrences
unidentified subjects;
label only
120,301556,331
identified subjects;
canonical ID
104,4811,540,795


* How do we want to approach this filtering?

## Rejected approaches

* Only support filtering by ID, and mint our own identifiers for unidentified subjects.
Then all subjects are identified and we can filter by them.

e.g. such subjects could get a source identifier

```json
{
"identifierType": {
"id": "wellcome-catalogue-label",
"label": "Wellcome catalogue label",
"type": "IdentifierType"
},
"value": "Mental health",
"type": "Identifier"
}
```

## See also

* [RFC 008](../008-api-filtering): API Filtering
* [RFC 037](../037-api-faceting-principles): API faceting principles & expectations
21 changes: 21 additions & 0 deletions rfcs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,20 @@ When an RFC is merged it provides a guide to implementing that change when it is
## Table of contents

<dl>
<dt>
<a href="./008-api-filtering">RFC 008</a>: API Filtering
</dt>
<dd>
Defining a set of patterns for filtering and sorting in the catalogue API.
</dd>

<dt>
<a href="./037-api-faceting-principles">RFC 037</a>: API faceting principles & expectations
</dt>
<dd>
Standards for filtering and aggregations in the catalogue API, including naming and response types.
</dd>

<dt>
<a href="./047-catalogue-api-index-structure">RFC 047</a>: Changing the structure of the Catalogue API index
</dt>
Expand All @@ -33,4 +47,11 @@ When an RFC is merged it provides a guide to implementing that change when it is
<dd>
Some discussion about how we might model subjects and people in the concepts API.
</dd>

<dt>
<a href="./053-filtering-by-subject">RFC 053</a>: Filtering by contributor, genre, and subject
</dt>
<dd>
How we'll support filtering for per-subject pages as part of the concepts work.
</dd>
</dl>