Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore previous sorting method #65

Closed
gaurav opened this issue Jun 22, 2023 · 3 comments
Closed

Restore previous sorting method #65

gaurav opened this issue Jun 22, 2023 · 3 comments

Comments

@gaurav
Copy link
Contributor

gaurav commented Jun 22, 2023

May fix #50

Need to restore old sort from:

query = f"http://{SOLR_HOST}:{SOLR_PORT}/solr/name_lookup/select"
params = {
"query": name_filters,
"limit": 0,
"sort": "length ASC",
"facet": {
"categories": {
"type": "terms",
"field": "curie",
"sort": "x asc",
"offset": offset,
"limit": limit,
"facet": {
"x": "min(length)",
},
"numBuckets": True,
}
}
}

@cbizon
Copy link
Contributor

cbizon commented Jun 23, 2023

Had this discussion in slack, but putting here as well:

Looking. back at v 1.2, I think that what happens is that the first query does some sorting, but it doesn't really matter because the second query is the one that goes at gets all the names for the curies found in query 1:
7:41
{
"query": f"({curie_filter}) AND ({name_filters})",
"limit": 1000000,
"sort": "length ASC",
"fields": "curie,name",
}
7:42
So now you have this sorted by length of the name, but with the curie on mutliple rows and all the curies intermixed together.
So then the results get parsed out like this:
output = defaultdict(list)
for doc in response.json()["response"]["docs"]:
output[doc["curie"]].append(doc["name"])
(edited)
7:44
And this output dict is magically preserving the order. curie with the shortest doc gets added first. Now the sorting of the first query does actually matter, because you want to get the curies that have the shortest name first.
7:45
So for us, either we need to add another parameter on the doc which is the length of the shortest element in the synonym list, or we need a way to figure it out in the query

Chris Bizon
8:02 PM
I think we could do it on the query with https://solr.apache.org/guide/7_6/function-queries.html#sort-by-function, we just need to figure out the right function. I think it's the length of the first synonym

@gaurav gaurav assigned gaurav and unassigned cbizon Jun 27, 2023
@gaurav
Copy link
Contributor Author

gaurav commented Jun 27, 2023

So for us, either we need to add another parameter on the doc which is the length of the shortest element in the synonym list, or we need a way to figure it out in the query

I’ve been looking in https://solr.apache.org/guide/7_6/function-queries.html and I don’t see a length function. I tried Googling for it and couldn’t find anything either (apart from ChatGPT which helpfully suggested the field_length() function that nobody else has heard of). I think you’re right that probably the best short-term solution is to reinstate the length field in Solr that we used to have previously. I’m going to call this shortest_name_length for now so it’s clear what it means, but I wonder if we should try to factor the information content value in here as well. I should be able to implement the shortest_name_length field in Babel pretty quickly, so I'll take back ownership of this ticket.

gaurav added a commit that referenced this issue Jul 6, 2023
This PR attempts to improve the search query issues we've been having by restoring the "sort by shortest synonym" rule as per #65 (comment). This required adding `shortest_name_length` to the Solr field configuration and to tests/data/test-synonyms.json. I haven't confirmed that this works as well as before, only that it seems to be working (I'll add those tests to the Babel Validator).

It also improves the documentation of the endpoints, fixes the version of nameresolution-data-loading to `latest`, and update `preferred_name` to LowerTextField so we can search for it using the same features as we do on the `names` field. We also add `http://localhost:8080` as a server for development.
@gaurav gaurav added this to the NameRes October 2023 milestone Sep 27, 2023
@gaurav
Copy link
Contributor Author

gaurav commented Dec 3, 2023

This has now been significantly improved with CURIE suffix sorting, and it's working well enough that is what is being used by Translator UI. Closing.

@gaurav gaurav closed this as completed Dec 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants