You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After looking at a few well-known UK terms containing diacritics, I had assumed that the UK release centre generally published synonyms with and without diacritics. For example, Sjögrens also has the synonym Sjogrens, making it possible to find whether the user types the diacritic or not. As such, the current behaviour is to default to not using a normalised index, requiring explicit opt-in at all levels of the API (Clojure/Java/HTTP).
The underlying assumption behind that decision is generally true, but further analysis shows that >70% of concepts with diacritics in their synonyms do not have an exact matching equivalent synonym without diacritics. This means that a user typing without diacritics may not find a term.
This code can be used to analyse all of the synonyms of all of the concepts, counting these statistics and identifying examples in which this is an issue:
Firstly, this streams all active known concepts. For each concept, it fetches the synonyms, identifying those with diacritics, and performs a search for those against the index to see if the original concept is found. The results below show are based on the latest UK clinical and drug extension editions as of today.
There are four broad approaches we need to evaluate:
Search occurs only against the 'term' field (no normalisation/folding) - the current situation.
Search occurs against the 'term' and the normalised/folded term
Search occurs against only the normalized/folded index
Search falls back to using normalised/folded term iff there are no matches for (1).
The considerations are:
Users should be able to find the results they expect
We do not adversely affect performance for the majority of searches.
Option 1. Search against only 'term' field by default
This is the current situation. In order to search against the folded/normalized index, clients must explicitly pass in a 'fold' parameter. The results shown are for when clients do not choose that fold parameter.
Here we see that only 188 concepts have a term with a diacritic. Of all of the synonyms for these 188 concepts, 193 are considered 'missing' by this analysis. For example, if we search for Kluver-Bucy syndrome without diacritics, we get no result unless we explicitly ask to search the folded index:
This provides impressive results with no missing terms. Not using diacritic characters in search still means that terms can be found successfully. Let's show an example:
(crit/bench (search svc {:s"Sjogren":max-hits500}))
Evaluation count : 1194300 in 60 samples of 19905 calls.
Execution time mean : 50.357447 µs
Execution time std-deviation : 237.716621ns
Execution time lower quantile : 50.147748 µs ( 2.5%)
Execution time upper quantile : 50.770990 µs (97.5%)
Overhead used : 1.911075ns
Found 6 outliers in 60 samples (10.0000 %)
low-severe 4 (6.6667 %)
low-mild 2 (3.3333 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
There's a small performance hit (12%). In addition, if we type a search term with diacritics, then unless we carefully manage excluded characters, we may return false positive results. This may not matter much in the UK, but certainly does affect other languages. Further analysis may be required in order to determine whether blanket normalisation might have unintended consequences.
Option 4: Fallback to folded index if no results
In this, we search our normal index FIRST, and fallback to using a folded index if and only if there are no results. The danger here is that we don't return results when we should. We have an issue with 3 out of the 3083444 descriptions within SNOMED CT.
We should expect performance metrics to be the same as option 1, unless there is a need to fallback.
(crit/bench (search svc {:s"Sjogren":max-hits500}))
Evaluation count : 1364340 in 60 samples of 22739 calls.
Execution time mean : 44.225304 µs
Execution time std-deviation : 289.493260ns
Execution time lower quantile : 43.922910 µs ( 2.5%)
Execution time upper quantile : 44.998974 µs (97.5%)
Overhead used : 1.911075ns
When we have to fallback, performance is hit. This affects a tiny proportion of results, of course.
(crit/bench (search svc {:s"Kluver-Bucy syndrome":max-hits500}))
Evaluation count : 511320 in 60 samples of 8522 calls.
Execution time mean : 116.728221 µs
Execution time std-deviation : 913.031619ns
Execution time lower quantile : 115.979940 µs ( 2.5%)
Execution time upper quantile : 118.438182 µs (97.5%)
Overhead used : 1.911075ns
The 3 missing results are because a search without using the folded index returns results that do not match the original concepts. That is because these three exceptions have now inactive descriptions.
For example, a search for Kienbock's disease returns results with non diacritic characters. Actually concept 84062004 is now called "Juvenile osteochondrosis of carpal lunate" and the synonym "Kienbock's" has been made inactive. Concept 787484007 is "Progressive avascular necrosis of lunate" and is distinct from 84062004.
(search svc {:s"Kienbock's disease"})
=>
(#com.eldrix.hermes.snomed.Result{:id139392017,
:conceptId84062004,
:term"Kienbock's disease",
:preferredTerm"Juvenile osteochondrosis of carpal lunate"}
#com.eldrix.hermes.snomed.Result{:id505902017,
:conceptId84062004,
:term"Kienbock's disease - osteochondritis of carpal lunate",
:preferredTerm"Juvenile osteochondrosis of carpal lunate"})
(search svc {:s"Kienbock's disease":foldtrue})
=>
(#com.eldrix.hermes.snomed.Result{:id3775830017,
:conceptId787484007,
:term"Kienböck's disease",
:preferredTerm"Progressive avascular necrosis of lunate"}
#com.eldrix.hermes.snomed.Result{:id139391012,
:conceptId84062004,
:term"Kienböck's disease",
:preferredTerm"Juvenile osteochondrosis of carpal lunate"}
#com.eldrix.hermes.snomed.Result{:id139392017,
:conceptId84062004,
:term"Kienbock's disease",
:preferredTerm"Juvenile osteochondrosis of carpal lunate"}
#com.eldrix.hermes.snomed.Result{:id505902017,
:conceptId84062004,
:term"Kienbock's disease - osteochondritis of carpal lunate",
:preferredTerm"Juvenile osteochondrosis of carpal lunate"})
While this only affects 3 results, the search results for these terms are surprising. The only way to get a result is to use a diacritic or to use the folded index:
But now we get both concepts, and the user can choose based on the modern preferred term. That suggests a fallback approach can lead to unexpected results.
Discussion
The current default is difficult to defend. Client applications need to explicitly request 'fold' to increase the sensitivity of text searches containing diacritics. However, Hermes is live in a number of clinical environments and returning excellent results with high performance. It would be inappropriate to increase the sensitivity at a loss of specificity - particularly if that impacts users using Hermes with other languages in which diacritics can affect the semantics and therefore should be excluded. We can mitigate that by using excluded characters and similar heuristics based on language preferences. Simply using both the normal and folded indices has a difficult to justify performance impact. Using a fallback provides good results, does not change current behaviour, but has identified a very small number of problem concepts.
The text was updated successfully, but these errors were encountered:
As such, it seems best to perform search using only the folded index.
This handles other languages, because different characters can be excluded from normalisation on a case-by-case basis, and we should bias towards high sensitivity rather than high specificity - after all, clinical users presented with pick list will be able to choose whereas if the option isn't even shown, no choice is possible. Using two indices is redundant when the normalised index will have greater sensivity, and wastes disk space on two inverted indices. The fallback approach leads to incorrect results; sometimes better results would have been available on the second pass, but because some poor results were obtained on first pass, that second pass against the normalised index doesn't get the chance to contribute results, and wastes space with two inverted indices.
This does have a very small performance impact, but this is better than missing results or returning less good results for a given search. It is likely that some optimisation might be possible to reduce this impact to a minimum.
The use of only a normalised index means that the unadulterated original term text does not need to be indexed, saving space (albeit only 46Mb for a typical distribution ).
After looking at a few well-known UK terms containing diacritics, I had assumed that the UK release centre generally published synonyms with and without diacritics. For example, Sjögrens also has the synonym Sjogrens, making it possible to find whether the user types the diacritic or not. As such, the current behaviour is to default to not using a normalised index, requiring explicit opt-in at all levels of the API (Clojure/Java/HTTP).
The underlying assumption behind that decision is generally true, but further analysis shows that >70% of concepts with diacritics in their synonyms do not have an exact matching equivalent synonym without diacritics. This means that a user typing without diacritics may not find a term.
This code can be used to analyse all of the synonyms of all of the concepts, counting these statistics and identifying examples in which this is an issue:
Firstly, this streams all active known concepts. For each concept, it fetches the synonyms, identifying those with diacritics, and performs a search for those against the index to see if the original concept is found. The results below show are based on the latest UK clinical and drug extension editions as of today.
There are four broad approaches we need to evaluate:
The considerations are:
Option 1. Search against only 'term' field by default
This is the current situation. In order to search against the folded/normalized index, clients must explicitly pass in a 'fold' parameter. The results shown are for when clients do not choose that fold parameter.
Here we see that only 188 concepts have a term with a diacritic. Of all of the synonyms for these 188 concepts, 193 are considered 'missing' by this analysis. For example, if we search for Kluver-Bucy syndrome without diacritics, we get no result unless we explicitly ask to search the folded index:
Importantly, this baseline approach is very fast:
A search returns in 44 microseconds.
Option 2. Search by term and folded term by default
In this option, we always search both indices.
This provides impressive results with no missing terms. Not using diacritic characters in search still means that terms can be found successfully. Let's show an example:
We've typed our term without diacritics, and it has returned a result with diacritics.
However, this comes at a~40% performance cost:
This is expected, as Lucene is searching for our text string across two fields.
In addition, we now have two inverted indexes, creating a small increase in overall database size.
Option 3: Search folded index only
Here we need to ensure our search term is appropriately folded for the language required.
There are no search terms with missing results.
There's a small performance hit (12%). In addition, if we type a search term with diacritics, then unless we carefully manage excluded characters, we may return false positive results. This may not matter much in the UK, but certainly does affect other languages. Further analysis may be required in order to determine whether blanket normalisation might have unintended consequences.
Option 4: Fallback to folded index if no results
In this, we search our normal index FIRST, and fallback to using a folded index if and only if there are no results. The danger here is that we don't return results when we should. We have an issue with 3 out of the 3083444 descriptions within SNOMED CT.
We should expect performance metrics to be the same as option 1, unless there is a need to fallback.
When we have to fallback, performance is hit. This affects a tiny proportion of results, of course.
The 3 missing results are because a search without using the folded index returns results that do not match the original concepts. That is because these three exceptions have now inactive descriptions.
For example, a search for Kienbock's disease returns results with non diacritic characters. Actually concept 84062004 is now called "Juvenile osteochondrosis of carpal lunate" and the synonym "Kienbock's" has been made inactive. Concept 787484007 is "Progressive avascular necrosis of lunate" and is distinct from 84062004.
While this only affects 3 results, the search results for these terms are surprising. The only way to get a result is to use a diacritic or to use the folded index:
But now we get both concepts, and the user can choose based on the modern preferred term. That suggests a fallback approach can lead to unexpected results.
Discussion
The current default is difficult to defend. Client applications need to explicitly request 'fold' to increase the sensitivity of text searches containing diacritics. However, Hermes is live in a number of clinical environments and returning excellent results with high performance. It would be inappropriate to increase the sensitivity at a loss of specificity - particularly if that impacts users using Hermes with other languages in which diacritics can affect the semantics and therefore should be excluded. We can mitigate that by using excluded characters and similar heuristics based on language preferences. Simply using both the normal and folded indices has a difficult to justify performance impact. Using a fallback provides good results, does not change current behaviour, but has identified a very small number of problem concepts.
The text was updated successfully, but these errors were encountered: