Search in Umbraco don´t work as expected. #11176

lucasmichaelsengorm · 2021-09-23T08:04:25Z

Which exact Umbraco version are you using? For example: 8.13.1 - don't just write v8

8.14.0

Bug summary

When you have your own search based on the ExternalIndex, and want to search for terms like "Løns" or "Füre" it results in no hits.

Specifics

I have had a small discussion with Shazwazza on the Examine project, because I thought the issue was there, but I manage to narrow it down to bee in Umbraco - see the thread here

Steps to reproduce

Add some nodes, containing fx. Danish char. like "Lønsikring", "Lønstigning", "Ansøgning"

Create a search

public void Search(string query = "løn") {
  IExamineManager examineManager = ExamineManager.Instance;
  
  if (!examineManager.TryGetIndex(Constants.UmbracoIndexes.ExternalIndexName, out IIndex index))
         throw new InvalidOperationException($"No index found by name {Constants.UmbracoIndexes.ExternalIndexName}");
  
  var searcher = index.GetSearcher().CreateQuery().Field("nodeName", query.MultipleCharacterWildcard()).Execute();
  var hits = searcher.TotalItemCount;
  
  if (hits == 0)
      throw new Exception("no hits");
}

Run the search with term of "Løn"

Expected result / actual result

So I create 3 nodes, with a nodeName "Lønsikring ...." and try to search with the term "Løn", and wildcard it.
The expected result is that the TotalItemCount should bee 3, but the actual result is 0.

What I have discovered, is that the ExterenalIndex has the StandardAnalyzer, as I expected it to be, but if you look down to the FieldAnalyzer on the Field "nodeName", then the Field is stored with the CultureInvariantStandardAnalyzer, as the same for near all the fields. If you change the CultureInvariantStandardAnalyzer to be StandardAnalyzer on the field, you get the Expected.

When you search inside of Umbraco Dashboard, you get the expected result of 3 nodes, but the problem is that you actual get unwanted results, because you get the string "løn" parsed. So the search query like this +nodeName:lon* when is should be like +nodeName:Løn*. So why could it get unwanted results? Well if you have a node called something like "long bording day" and search, you get 4 result be searching for "Løn".

But try to look into the conventation i had with Shazwazzen - see the thread here

The text was updated successfully, but these errors were encountered:

bjarnef · 2021-09-23T13:38:45Z

I had previous a strange issue in Examine where e.g. the Danish character ø was replaced with o in the raw lucene query:
Shazwazza/Examine#181

I could reproduce the issue with Vendr demo store.

I haven't checked if this has been fixed in a newer version of Umbraco or Examine.

lucasmichaelsengorm · 2021-09-23T13:51:48Z

@bjarnef Examine don´t have the problem as Examine is doing what it´s told. But the problem is how Umbraco is configurating the Index, and Field Index. And i would like either to hear @UmbracoHQ why the have configurate the field to be with the CultureInvariantStandardAnalyzer and not the StandardAnalyzer, or if it simply a bug that is configuratet that way.

With Umbraco I can find the problem from version 8.3 - to the newest.

bjarnef · 2021-09-23T13:55:41Z

@lucasmichaelsengorm well regarding the issue I was linking to, it seemed to be how Examine constructed the raw lucene query under the hood (when I restarted app pool the lucene query changed), but it may be a different issue you are seeing.

lucasmichaelsengorm · 2021-09-23T14:24:49Z

@bjarnef you have a much funny question, witch is related to Examine, I´ll say this 2 cases a 2 diffendt things.

Here is the problem if I search for that word, like +nodeName:løn* and will expect results where Løn is part of the nodeName, with the externalIndex. But it is not hitting any results.

If you use this indexer.GetSearcher().CreateQuery().Field("nodeName", "løn".MultipleCharacterWildcard()).Execute() where the index analyzer is set to be CultureInvariantStandardAnalyzer for instanst the internalIndex, it´ll find noting. If you remove the MultipleCharacterWildcard func and search again the the internalIndex, Lucene will parse the term "løn" to "lon", but again not finding anything. if you nativequery with +nodeName:lon* it will find any matching node both machting words starting with løn and lon.

When you see how Umbraco is storing the field nodeName it is with the CultureInvariantStandardAnalyzer and therefore lucene is not able to find the term "Løn" without parsing it to be "lon", but Lucen is not build to parse the stirng with MultipleCharacterWildcard func.

If you change the field to be stored with the StandardAnalyzer and search with indexer.GetSearcher().CreateQuery().Field("nodeName", "løn".MultipleCharacterWildcard()).Execute() you a getting all the node where løn is part of the name.

So the problem here is how Umbraco is telling the fields to be stored.

Shazwazza · 2021-09-23T15:04:17Z

Hi all,

Here is the explanation of the issue and reasons behind it. It was unclear until now that Lucene does not translate a term that is a wildcard query using the analyzer. The issue is purely for wildcard searches. The reasons why Lucene does not translate a term for a wildcard query can be found here: Shazwazza/Examine#244 (comment)

Shazwazza · 2021-09-23T15:08:06Z

(oops, sent the last message too soon) ... continued:

The reason why this analyzer exists and is used is to simplify searching and indexing across languages for all users. So if you have the word "løn" it will be analyzed and indexed with 'ascii folding' and it becomes "lon". So now if you search on either "løn" or "lon" you will get the result. This can be a friendlier approach for editors if your site has a lot of languages and your editors don't know the nuances of any given language and it's accents.

The wildcard issue was unapparent until now. I'm unsure the best way to resolve that particular problem currently.

That said, Umbraco Examine is customizable. You are more than welcome to change the default analyzer and field types for any of the indexes and perhaps that makes sense for your installation.

bjarnef · 2021-09-23T16:24:46Z

@Shazwazza is there a recommended approach to search using Examine fluent API and wildcard to match many Danish words including æ, ø and å?

I have this on a project, but it seems it by default doesn't find results when search term gad æ, ø and å... but when a replace the characters, e.g. æ => ae, ø => o and å => a

public class SearchSurfaceController : SurfaceController
{
        private readonly IExamineManager _examineManager;
        private readonly IUmbracoContextAccessor _umbracoContextAccessor;

        public SearchSurfaceController(IExamineManager examineManager, IUmbracoContextAccessor umbracoContextAccessor)
        {
            _examineManager = examineManager;
            _umbracoContextAccessor = umbracoContextAccessor;
        }

        [ChildActionOnly]
        public ActionResult Search(string q = "", int p = 1, int ps = 12)
        {
            var result = new PagedResult<IPublishedContent>(0, 1, ps);

            if (!q.IsNullOrWhiteSpace() && _examineManager.TryGetIndex(Constants.UmbracoIndexes.ExternalIndexName, out var index))
            {
                var searcher = index.GetSearcher();
                var query = searcher.CreateQuery(IndexTypes.Content)
                    .RangeQuery<DateTime>(new[] { "createDate" }, DateTime.MinValue, DateTime.MaxValue)
                    .And()
                    .GroupedNot(new[] { "templateID" }, new [] { "0" });

                var searchFields = new[] { "nodeName", "sections", "contents" };

                if (!string.IsNullOrEmpty(q))
                {
                    var searchTerms = Tokenize(q);
                    query.And().GroupedOr(searchFields, searchTerms.Select(x => x.MultipleCharacterWildcard()).ToArray());
                }

                var results = query.Execute(ps * p);

                var totalResults = results.TotalItemCount;
                var pagedResults = results.Skip(ps * (p - 1));

                var items = pagedResults.ToPublishedSearchResults(_umbracoContextAccessor.UmbracoContext.Content)
                                        .Select(x => x.Content);

                result = new PagedResult<IPublishedContent>(totalResults, p, ps)
                {
                    Items = items
                };
            }

            return PartialView("SearchResults", result);
        }

        public IEnumerable<string> Tokenize(string input)
        {
            return Regex.Matches(input, @"[\""].+?[\""]|[^ ]+")
                .Cast<Match>()
                .Select(m => m.Value.Trim('\"').ToLower())
                .ToList();
        }
}

lucasmichaelsengorm · 2021-09-23T20:17:32Z

@Shazwazza But as far as I know, is that lucen is running charset of iso-8859-8 witch contains æ, ø, å out of the box.
It´ll make more sense to find the key to the evil, right now we know there is problem with the Danish language, but if test, with both sweedish and germany, we have same problem if you search witch words contain char of ä, or ü.

For the standard solution Umbraco is giving out of the box, it look like it only support english 100%. Witch is kind of sad :(

Shazwazza · 2021-09-27T16:52:51Z

Hi all. I will recap the issue - we know the 'key to the evil':

Umbraco ships with the CultureInvariantStandardAnalyzer for the internal index because it was intended to simplify searching for back office data regardless of your language so if you have mixed editors from different languages, they don't need to know specific char accents. The CultureInvariantStandardAnalyzer does "Ascii folding" so converts all unicode accented chars to their equivalent ascii chars.
This does work, but it doesn't work with wildcards - which was only discovered NOW

Moving forward: Figure out the best approach now that we know this doesn't work for accented languages + wildcard queries. There's probably a few options but ultimately there will be no 'perfect' solution for everyone's website. In those cases, folks should configure an appropriate analyzer for what works for them.

Possible solutions:

Change Examine to allow for Umbraco to specify an option to run the lucene query through the analyzer before appending wildcard chars. Mentioned here with possible approaches Search with danish char. Shazwazza/Examine#244 (comment)
Just use the StandardAnalyzer as default for the internal/members indexes, or possibly an even simpler analyzer. This will not convert chars.

lucasmichaelsengorm · 2021-09-27T18:36:05Z

Hello @Shazwazza
thanks for the recape. I just have a question, or simple I just need at little bit more details.

For my code I use the ExternalIndex, witch is using the StandardAnalyzer, but then I still need to override, the Field analyzer to be the standardAnalyzer as well. Is there a simple way to override so the Fields are stored with the standardAnalyzer?

umbrabot · 2022-07-25T08:40:25Z

Hiya @lucasmichaelsengorm,

Just wanted to let you know that we noticed that this issue got a bit stale and might not be relevant any more.

We will close this issue for now but we're happy to open it up again if you think it's still relevant (for example: it's a feature request that's not yet implemented, or it's a bug that's not yet been fixed).

To open it this issue up again, you can write @umbrabot still relevant in a new comment as the first line. It would be super helpful for us if on the next line you could let us know why you think it's still relevant.

For example:

@umbrabot still relevant
This bug can still be reproduced in version x.y.z

This will reopen the issue in the next few hours.

Thanks, from your friendly Umbraco GitHub bot 🤖 🙂

lucasmichaelsengorm added the type/bug label Sep 23, 2021

Shazwazza mentioned this issue Dec 20, 2021

Issue with special characters when upgrading Examine Shazwazza/Examine#263

Closed

jrunestone mentioned this issue Jun 27, 2022

Searches for nodes with Swedish Special Characters (ÅÄÖ) in Backoffice #12626

Open

umbrabot added the status/stale Marked as stale due to inactivity label Jul 25, 2022

umbrabot closed this as completed Jul 25, 2022

bjarnef mentioned this issue May 6, 2024

Wildcard search in GroupedOr() Shazwazza/Examine#383

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search in Umbraco don´t work as expected. #11176

Search in Umbraco don´t work as expected. #11176

lucasmichaelsengorm commented Sep 23, 2021

bjarnef commented Sep 23, 2021 •

edited

Loading

lucasmichaelsengorm commented Sep 23, 2021

bjarnef commented Sep 23, 2021 •

edited

Loading

lucasmichaelsengorm commented Sep 23, 2021

Shazwazza commented Sep 23, 2021

Shazwazza commented Sep 23, 2021 •

edited

Loading

bjarnef commented Sep 23, 2021 •

edited

Loading

lucasmichaelsengorm commented Sep 23, 2021

Shazwazza commented Sep 27, 2021

lucasmichaelsengorm commented Sep 27, 2021

umbrabot commented Jul 25, 2022

Search in Umbraco don´t work as expected. #11176

Search in Umbraco don´t work as expected. #11176

Comments

lucasmichaelsengorm commented Sep 23, 2021

Which exact Umbraco version are you using? For example: 8.13.1 - don't just write v8

Bug summary

Specifics

Steps to reproduce

Expected result / actual result

bjarnef commented Sep 23, 2021 • edited Loading

lucasmichaelsengorm commented Sep 23, 2021

bjarnef commented Sep 23, 2021 • edited Loading

lucasmichaelsengorm commented Sep 23, 2021

Shazwazza commented Sep 23, 2021

Shazwazza commented Sep 23, 2021 • edited Loading

bjarnef commented Sep 23, 2021 • edited Loading

lucasmichaelsengorm commented Sep 23, 2021

Shazwazza commented Sep 27, 2021

lucasmichaelsengorm commented Sep 27, 2021

umbrabot commented Jul 25, 2022

bjarnef commented Sep 23, 2021 •

edited

Loading

bjarnef commented Sep 23, 2021 •

edited

Loading

Shazwazza commented Sep 23, 2021 •

edited

Loading

bjarnef commented Sep 23, 2021 •

edited

Loading