Fix issue with duplicate subject labels in aggregations #5563

alexwlchan · 2022-06-22T14:48:14Z

Consider the following API request: https://api.wellcomecollection.org/catalogue/v2/images?source.subjects.label=Horses&aggregations=source.subjects.label

This is the aggregation:

        {
          "data": {
            "id": "zssqcytq",
            "label": "Horses",
            "concepts": [],
            "type": "Subject"
          },
          "count": 364,
          "type": "AggregationBucket"
        },
        {
          "data": {
            "label": "Horses",
            "concepts": [],
            "type": "Subject"
          },
          "count": 150,
          "type": "AggregationBucket"
        },
        ...
        {
          "data": {
            "id": "jrnfsbrf",
            "label": "Horses",
            "concepts": [],
            "type": "Subject"
          },
          "count": 12,
          "type": "AggregationBucket"
        },

Because the phrase "Horses" appears three times, it appears as three different entries in the front-end filter; consider https://www-stage.wellcomecollection.org/images?source.subjects.label=Horses

For aggregations it's probably sufficient to bin the IDs when we're aggregating by label, but this might point to a broader issue for subjects later on.

The text was updated successfully, but these errors were encountered:

alexwlchan · 2022-06-23T06:02:12Z

You can see the same issue here for works: https://api.wellcomecollection.org/catalogue/v2/works?aggregations=subjects.label&query=horses

jtweed · 2022-06-23T16:08:33Z

Yeah this is a definite bug, at least in the "Do What I Mean" sense. Aggregating by label should do just that, it shouldn't also take IDs into account.

I guess we just pick the first one for returning in the aggregation, as to then go on and filter by label should return results with any of the IDs.

Once we filter by IDs, then yes we probably need to aggregate by IDs too. Concept merging should hopefully fix this problem, though it would be good to confirm why this is occurring. Are these concepts from multiple sources or are we giving an identity to strings without de-duping?

alexwlchan · 2022-06-23T20:18:24Z

I guess we just pick the first one for returning in the aggregation, as to then go on and filter by label should return results with any of the IDs.

I think we should take a slightly different approach, and remove the identifiers for the purpose of aggregations – so in this case we'd have a single, unidentified aggregation with 364 + 150 + 12 = 526 entries for "horse". This is what we used to have, and it got broken in some recent index restructuring.

Are these concepts from multiple sources or are we giving an identity to strings without de-duping?

Multiple sources.

Work ID	Sierra ID	MARC source value
bvt9jdzf	b2498582x	`650 0 Horses.\|0sh 85062160` (650 ind 2 = 0 ⇒ LCSH)
y9dyqe8m	b11318430	`650 2 Horses.\|0D006736` (650 ind 2 = 2 ⇒ MeSH)
uzen4bba	b15439306	`650 2 Horses.`

jtweed · 2022-06-24T11:34:31Z

That sounds like a much better approach, then when we have identified filters and aggregations those can return the identified concept.

Interesting example, thanks. Shows that as part of wellcomecollection/docs#83 we do need to match and merge concepts.

alexwlchan · 2022-07-05T10:15:45Z

Fixed by the latest reindex.

alexwlchan added the 📚Catalogue label Jun 22, 2022

alexwlchan mentioned this issue Jun 24, 2022

Remove ids when aggregating by subject wellcomecollection/catalogue-pipeline#2137

Merged

alexwlchan self-assigned this Jun 24, 2022

jtweed mentioned this issue Jun 24, 2022

052 concepts pipeline wellcomecollection/docs#83

Merged

alexwlchan closed this as completed Jul 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue with duplicate subject labels in aggregations #5563

Fix issue with duplicate subject labels in aggregations #5563

alexwlchan commented Jun 22, 2022

alexwlchan commented Jun 23, 2022

jtweed commented Jun 23, 2022

alexwlchan commented Jun 23, 2022 •

edited

Loading

jtweed commented Jun 24, 2022

alexwlchan commented Jul 5, 2022

Fix issue with duplicate subject labels in aggregations #5563

Fix issue with duplicate subject labels in aggregations #5563

Comments

alexwlchan commented Jun 22, 2022

alexwlchan commented Jun 23, 2022

jtweed commented Jun 23, 2022

alexwlchan commented Jun 23, 2022 • edited Loading

jtweed commented Jun 24, 2022

alexwlchan commented Jul 5, 2022

alexwlchan commented Jun 23, 2022 •

edited

Loading