Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with duplicate subject labels in aggregations #5563

Closed
alexwlchan opened this issue Jun 22, 2022 · 5 comments
Closed

Fix issue with duplicate subject labels in aggregations #5563

alexwlchan opened this issue Jun 22, 2022 · 5 comments
Assignees

Comments

@alexwlchan
Copy link
Contributor

Consider the following API request: https://api.wellcomecollection.org/catalogue/v2/images?source.subjects.label=Horses&aggregations=source.subjects.label

This is the aggregation:

        {
          "data": {
            "id": "zssqcytq",
            "label": "Horses",
            "concepts": [],
            "type": "Subject"
          },
          "count": 364,
          "type": "AggregationBucket"
        },
        {
          "data": {
            "label": "Horses",
            "concepts": [],
            "type": "Subject"
          },
          "count": 150,
          "type": "AggregationBucket"
        },
        ...
        {
          "data": {
            "id": "jrnfsbrf",
            "label": "Horses",
            "concepts": [],
            "type": "Subject"
          },
          "count": 12,
          "type": "AggregationBucket"
        },

Because the phrase "Horses" appears three times, it appears as three different entries in the front-end filter; consider https://www-stage.wellcomecollection.org/images?source.subjects.label=Horses

For aggregations it's probably sufficient to bin the IDs when we're aggregating by label, but this might point to a broader issue for subjects later on.

@alexwlchan
Copy link
Contributor Author

@jtweed
Copy link
Contributor

jtweed commented Jun 23, 2022

Yeah this is a definite bug, at least in the "Do What I Mean" sense. Aggregating by label should do just that, it shouldn't also take IDs into account.

I guess we just pick the first one for returning in the aggregation, as to then go on and filter by label should return results with any of the IDs.

Once we filter by IDs, then yes we probably need to aggregate by IDs too. Concept merging should hopefully fix this problem, though it would be good to confirm why this is occurring. Are these concepts from multiple sources or are we giving an identity to strings without de-duping?

@alexwlchan
Copy link
Contributor Author

alexwlchan commented Jun 23, 2022

I guess we just pick the first one for returning in the aggregation, as to then go on and filter by label should return results with any of the IDs.

I think we should take a slightly different approach, and remove the identifiers for the purpose of aggregations – so in this case we'd have a single, unidentified aggregation with 364 + 150 + 12 = 526 entries for "horse". This is what we used to have, and it got broken in some recent index restructuring.

Are these concepts from multiple sources or are we giving an identity to strings without de-duping?

Multiple sources.

Work IDSierra IDMARC source value
bvt9jdzf b2498582x650 0 Horses.|0sh 85062160 (650 ind 2 = 0 ⇒ LCSH)
y9dyqe8mb11318430650 2 Horses.|0D006736 (650 ind 2 = 2 ⇒ MeSH)
uzen4bbab15439306650 2 Horses.

@jtweed
Copy link
Contributor

jtweed commented Jun 24, 2022

That sounds like a much better approach, then when we have identified filters and aggregations those can return the identified concept.

Interesting example, thanks. Shows that as part of wellcomecollection/docs#83 we do need to match and merge concepts.

@alexwlchan
Copy link
Contributor Author

Fixed by the latest reindex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants