Search should default to including normalised (folded) index #57

wardle · 2023-09-08T17:15:13Z

After looking at a few well-known UK terms containing diacritics, I had assumed that the UK release centre generally published synonyms with and without diacritics. For example, Sjögrens also has the synonym Sjogrens, making it possible to find whether the user types the diacritic or not. As such, the current behaviour is to default to not using a normalised index, requiring explicit opt-in at all levels of the API (Clojure/Java/HTTP).

The underlying assumption behind that decision is generally true, but further analysis shows that >70% of concepts with diacritics in their synonyms do not have an exact matching equivalent synonym without diacritics. This means that a user typing without diacritics may not find a term.

This code can be used to analyse all of the synonyms of all of the concepts, counting these statistics and identifying examples in which this is an issue:

(defn ^:private analyse-diacritics
  [svc]
  (let [ch (a/chan 1 (filter :active))]
    (a/thread (stream-all-concepts svc ch))
    (loop [n-concepts 0, missing 0, results []]
      (if-let [concept (a/<!! ch)]
        (let [s1 (set (map :term (synonyms svc (:id concept))))
              s2 (set (map #(lang/fold "en" %) s1))
              diff (set/difference s2 s1)
              diff' (remove #(are-any? svc (set (map :conceptId (search svc {:s %}))) [(:id concept)]) diff)]
          (recur (if (seq diff) (inc n-concepts) n-concepts)
                 (+ missing (count diff'))
                 (if (seq diff') (conj results {:concept-id (:id concept) :missing diff'}) results)))
        {:n-concepts n-concepts :missing missing :results results}))))

Firstly, this streams all active known concepts. For each concept, it fetches the synonyms, identifying those with diacritics, and performs a search for those against the index to see if the original concept is found. The results below show are based on the latest UK clinical and drug extension editions as of today.

There are four broad approaches we need to evaluate:

Search occurs only against the 'term' field (no normalisation/folding) - the current situation.
Search occurs against the 'term' and the normalised/folded term
Search occurs against only the normalized/folded index
Search falls back to using normalised/folded term iff there are no matches for (1).

The considerations are:

Users should be able to find the results they expect
We do not adversely affect performance for the majority of searches.

Option 1. Search against only 'term' field by default

This is the current situation. In order to search against the folded/normalized index, clients must explicitly pass in a 'fold' parameter. The results shown are for when clients do not choose that fold parameter.

(analyse-diacritics svc)
=>
{:n-concepts 188,
 :missing 193,
 :results [{:concept-id 9638002, :missing ("Dejerine's syndrome II")}
           {:concept-id 10651001, :missing ("Kluver-Bucy syndrome")}
           {:concept-id 13445001, :missing ("Meniere's syndrome, NOS" "Meniere's syndrome")}
           {:concept-id 19447003, :missing ("Structure of colonic crypt of Lieberkuhn")}
           {:concept-id 21512007, :missing ("Structure of appendiceal crypt of Lieberkuhn")}
           {:concept-id 29307005, :missing ("Luckenschadel")}

Here we see that only 188 concepts have a term with a diacritic. Of all of the synonyms for these 188 concepts, 193 are considered 'missing' by this analysis. For example, if we search for Kluver-Bucy syndrome without diacritics, we get no result unless we explicitly ask to search the folded index:

(search svc {:s "Kluver-Bucy syndrome"})
=> nil
(search svc {:s "Kluver-Bucy syndrome" :fold true})
=>
(#com.eldrix.hermes.snomed.Result{:id 18506014,
                                  :conceptId 10651001,
                                  :term "Klüver-Bucy syndrome",
                                  :preferredTerm "Temporal lobectomy behaviour syndrome"})

Importantly, this baseline approach is very fast:

(crit/bench (search svc {:s "Sjogren" :max-hits 500}))
Evaluation count : 1344300 in 60 samples of 22405 calls.
             Execution time mean : 44.834638 µs
    Execution time std-deviation : 274.254633 ns
   Execution time lower quantile : 44.645308 µs ( 2.5%)
   Execution time upper quantile : 45.125480 µs (97.5%)
                   Overhead used : 1.891543 ns

A search returns in 44 microseconds.

Option 2. Search by term and folded term by default

In this option, we always search both indices.

(analyse-diacritics svc)
=> {:n-concepts 188, :missing 0, :results []}

This provides impressive results with no missing terms. Not using diacritic characters in search still means that terms can be found successfully. Let's show an example:

(search svc {:s "Kluver-Bucy syndrome"})
=>
(#com.eldrix.hermes.snomed.Result{:id 18506014,
                                  :conceptId 10651001,
                                  :term "Klüver-Bucy syndrome",
                                  :preferredTerm "Temporal lobectomy behaviour syndrome"})

We've typed our term without diacritics, and it has returned a result with diacritics.

However, this comes at a~40% performance cost:

(crit/bench (search svc {:s "Sjogren" :max-hits 500}))
Evaluation count : 966120 in 60 samples of 16102 calls.
             Execution time mean : 62.420293 µs
    Execution time std-deviation : 558.211577 ns
   Execution time lower quantile : 62.040914 µs ( 2.5%)
   Execution time upper quantile : 63.868466 µs (97.5%)
                   Overhead used : 1.891543 ns

This is expected, as Lucene is searching for our text string across two fields.

In addition, we now have two inverted indexes, creating a small increase in overall database size.

Option 3: Search folded index only

Here we need to ensure our search term is appropriately folded for the language required.

(analyse-diacritics svc)
=> {:n-concepts 188, :missing 0, :results []}

There are no search terms with missing results.

(crit/bench (search svc {:s "Sjogren" :max-hits 500}))
Evaluation count : 1194300 in 60 samples of 19905 calls.
             Execution time mean : 50.357447 µs
    Execution time std-deviation : 237.716621 ns
   Execution time lower quantile : 50.147748 µs ( 2.5%)
   Execution time upper quantile : 50.770990 µs (97.5%)
                   Overhead used : 1.911075 ns

Found 6 outliers in 60 samples (10.0000 %)
	low-severe	 4 (6.6667 %)
	low-mild	 2 (3.3333 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers

There's a small performance hit (12%). In addition, if we type a search term with diacritics, then unless we carefully manage excluded characters, we may return false positive results. This may not matter much in the UK, but certainly does affect other languages. Further analysis may be required in order to determine whether blanket normalisation might have unintended consequences.

Option 4: Fallback to folded index if no results

In this, we search our normal index FIRST, and fallback to using a folded index if and only if there are no results. The danger here is that we don't return results when we should. We have an issue with 3 out of the 3083444 descriptions within SNOMED CT.

(analyse-diacritics svc)
=>
{:n-concepts 188,
 :missing 3,
 :results [{:concept-id 253828000, :missing ("Aplasia of Mullerian ducts")}
           {:concept-id 733522005, :missing ("Neuhauser syndrome")}
           {:concept-id 787484007, :missing ("Kienbock's disease")}]}

We should expect performance metrics to be the same as option 1, unless there is a need to fallback.

(crit/bench (search svc {:s "Sjogren" :max-hits 500}))
Evaluation count : 1364340 in 60 samples of 22739 calls.
             Execution time mean : 44.225304 µs
    Execution time std-deviation : 289.493260 ns
   Execution time lower quantile : 43.922910 µs ( 2.5%)
   Execution time upper quantile : 44.998974 µs (97.5%)
                   Overhead used : 1.911075 ns

When we have to fallback, performance is hit. This affects a tiny proportion of results, of course.

(crit/bench (search svc {:s "Kluver-Bucy syndrome" :max-hits 500}))
Evaluation count : 511320 in 60 samples of 8522 calls.
             Execution time mean : 116.728221 µs
    Execution time std-deviation : 913.031619 ns
   Execution time lower quantile : 115.979940 µs ( 2.5%)
   Execution time upper quantile : 118.438182 µs (97.5%)
                   Overhead used : 1.911075 ns

The 3 missing results are because a search without using the folded index returns results that do not match the original concepts. That is because these three exceptions have now inactive descriptions.

For example, a search for Kienbock's disease returns results with non diacritic characters. Actually concept 84062004 is now called "Juvenile osteochondrosis of carpal lunate" and the synonym "Kienbock's" has been made inactive. Concept 787484007 is "Progressive avascular necrosis of lunate" and is distinct from 84062004.

(search svc {:s "Kienbock's disease"})
=>
(#com.eldrix.hermes.snomed.Result{:id 139392017,
                                  :conceptId 84062004,
                                  :term "Kienbock's disease",
                                  :preferredTerm "Juvenile osteochondrosis of carpal lunate"}
 #com.eldrix.hermes.snomed.Result{:id 505902017,
                                  :conceptId 84062004,
                                  :term "Kienbock's disease - osteochondritis of carpal lunate",
                                  :preferredTerm "Juvenile osteochondrosis of carpal lunate"})

(search svc {:s "Kienbock's disease" :fold true})
=>
(#com.eldrix.hermes.snomed.Result{:id 3775830017,
                                  :conceptId 787484007,
                                  :term "Kienböck's disease",
                                  :preferredTerm "Progressive avascular necrosis of lunate"}
 #com.eldrix.hermes.snomed.Result{:id 139391012,
                                  :conceptId 84062004,
                                  :term "Kienböck's disease",
                                  :preferredTerm "Juvenile osteochondrosis of carpal lunate"}
 #com.eldrix.hermes.snomed.Result{:id 139392017,
                                  :conceptId 84062004,
                                  :term "Kienbock's disease",
                                  :preferredTerm "Juvenile osteochondrosis of carpal lunate"}
 #com.eldrix.hermes.snomed.Result{:id 505902017,
                                  :conceptId 84062004,
                                  :term "Kienbock's disease - osteochondritis of carpal lunate",
                                  :preferredTerm "Juvenile osteochondrosis of carpal lunate"})

While this only affects 3 results, the search results for these terms are surprising. The only way to get a result is to use a diacritic or to use the folded index:

(search svc {:s "Kienböck's disease"})
=>
(#com.eldrix.hermes.snomed.Result{:id 3775830017,
                                  :conceptId 787484007,
                                  :term "Kienböck's disease",
                                  :preferredTerm "Progressive avascular necrosis of lunate"}
 #com.eldrix.hermes.snomed.Result{:id 139391012,
                                  :conceptId 84062004,
                                  :term "Kienböck's disease",
                                  :preferredTerm "Juvenile osteochondrosis of carpal lunate"})

But now we get both concepts, and the user can choose based on the modern preferred term. That suggests a fallback approach can lead to unexpected results.

Discussion

The current default is difficult to defend. Client applications need to explicitly request 'fold' to increase the sensitivity of text searches containing diacritics. However, Hermes is live in a number of clinical environments and returning excellent results with high performance. It would be inappropriate to increase the sensitivity at a loss of specificity - particularly if that impacts users using Hermes with other languages in which diacritics can affect the semantics and therefore should be excluded. We can mitigate that by using excluded characters and similar heuristics based on language preferences. Simply using both the normal and folded indices has a difficult to justify performance impact. Using a fallback provides good results, does not change current behaviour, but has identified a very small number of problem concepts.

wardle · 2023-09-08T20:29:03Z

As such, it seems best to perform search using only the folded index.

This handles other languages, because different characters can be excluded from normalisation on a case-by-case basis, and we should bias towards high sensitivity rather than high specificity - after all, clinical users presented with pick list will be able to choose whereas if the option isn't even shown, no choice is possible. Using two indices is redundant when the normalised index will have greater sensivity, and wastes disk space on two inverted indices. The fallback approach leads to incorrect results; sometimes better results would have been available on the second pass, but because some poor results were obtained on first pass, that second pass against the normalised index doesn't get the chance to contribute results, and wastes space with two inverted indices.

This does have a very small performance impact, but this is better than missing results or returning less good results for a given search. It is likely that some optimisation might be possible to reduce this impact to a minimum.

The use of only a normalised index means that the unadulterated original term text does not need to be indexed, saving space (albeit only 46Mb for a typical distribution ).

wardle closed this as completed in 368ff49 Sep 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search should default to including normalised (folded) index #57

Search should default to including normalised (folded) index #57

wardle commented Sep 8, 2023 •

edited

Loading

wardle commented Sep 8, 2023 •

edited

Loading

Search should default to including normalised (folded) index #57

Search should default to including normalised (folded) index #57

Comments

wardle commented Sep 8, 2023 • edited Loading

Option 1. Search against only 'term' field by default

Option 2. Search by term and folded term by default

Option 3: Search folded index only

Option 4: Fallback to folded index if no results

Discussion

wardle commented Sep 8, 2023 • edited Loading

wardle commented Sep 8, 2023 •

edited

Loading

wardle commented Sep 8, 2023 •

edited

Loading