-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue with duplicate subject labels in aggregations #5563
Comments
You can see the same issue here for works: https://api.wellcomecollection.org/catalogue/v2/works?aggregations=subjects.label&query=horses |
Yeah this is a definite bug, at least in the "Do What I Mean" sense. Aggregating by label should do just that, it shouldn't also take IDs into account. I guess we just pick the first one for returning in the aggregation, as to then go on and filter by label should return results with any of the IDs. Once we filter by IDs, then yes we probably need to aggregate by IDs too. Concept merging should hopefully fix this problem, though it would be good to confirm why this is occurring. Are these concepts from multiple sources or are we giving an identity to strings without de-duping? |
I think we should take a slightly different approach, and remove the identifiers for the purpose of aggregations – so in this case we'd have a single, unidentified aggregation with 364 + 150 + 12 = 526 entries for "horse". This is what we used to have, and it got broken in some recent index restructuring.
Multiple sources.
|
That sounds like a much better approach, then when we have identified filters and aggregations those can return the identified concept. Interesting example, thanks. Shows that as part of wellcomecollection/docs#83 we do need to match and merge concepts. |
Fixed by the latest reindex. |
Consider the following API request: https://api.wellcomecollection.org/catalogue/v2/images?source.subjects.label=Horses&aggregations=source.subjects.label
This is the aggregation:
Because the phrase "Horses" appears three times, it appears as three different entries in the front-end filter; consider https://www-stage.wellcomecollection.org/images?source.subjects.label=Horses
For aggregations it's probably sufficient to bin the IDs when we're aggregating by label, but this might point to a broader issue for subjects later on.
The text was updated successfully, but these errors were encountered: