Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search in Umbraco don´t work as expected. #11176

Closed
lucasmichaelsengorm opened this issue Sep 23, 2021 · 11 comments
Closed

Search in Umbraco don´t work as expected. #11176

lucasmichaelsengorm opened this issue Sep 23, 2021 · 11 comments
Labels
status/stale Marked as stale due to inactivity type/bug

Comments

@lucasmichaelsengorm
Copy link
Contributor

Which exact Umbraco version are you using? For example: 8.13.1 - don't just write v8

8.14.0

Bug summary

When you have your own search based on the ExternalIndex, and want to search for terms like "Løns" or "Füre" it results in no hits.

Specifics

I have had a small discussion with Shazwazza on the Examine project, because I thought the issue was there, but I manage to narrow it down to bee in Umbraco - see the thread here

Steps to reproduce

  1. Add some nodes, containing fx. Danish char. like "Lønsikring", "Lønstigning", "Ansøgning"

  2. Create a search

    public void Search(string query = "løn") {
      IExamineManager examineManager = ExamineManager.Instance;
      
      if (!examineManager.TryGetIndex(Constants.UmbracoIndexes.ExternalIndexName, out IIndex index))
             throw new InvalidOperationException($"No index found by name {Constants.UmbracoIndexes.ExternalIndexName}");
      
      var searcher = index.GetSearcher().CreateQuery().Field("nodeName", query.MultipleCharacterWildcard()).Execute();
      var hits = searcher.TotalItemCount;
      
      if (hits == 0)
          throw new Exception("no hits");
    }
    
  3. Run the search with term of "Løn"

Expected result / actual result

So I create 3 nodes, with a nodeName "Lønsikring ...." and try to search with the term "Løn", and wildcard it.
The expected result is that the TotalItemCount should bee 3, but the actual result is 0.

What I have discovered, is that the ExterenalIndex has the StandardAnalyzer, as I expected it to be, but if you look down to the FieldAnalyzer on the Field "nodeName", then the Field is stored with the CultureInvariantStandardAnalyzer, as the same for near all the fields. If you change the CultureInvariantStandardAnalyzer to be StandardAnalyzer on the field, you get the Expected.

When you search inside of Umbraco Dashboard, you get the expected result of 3 nodes, but the problem is that you actual get unwanted results, because you get the string "løn" parsed. So the search query like this +nodeName:lon* when is should be like +nodeName:Løn*. So why could it get unwanted results? Well if you have a node called something like "long bording day" and search, you get 4 result be searching for "Løn".

But try to look into the conventation i had with Shazwazzen - see the thread here

@bjarnef
Copy link
Contributor

bjarnef commented Sep 23, 2021

I had previous a strange issue in Examine where e.g. the Danish character ø was replaced with o in the raw lucene query:
Shazwazza/Examine#181

I could reproduce the issue with Vendr demo store.

I haven't checked if this has been fixed in a newer version of Umbraco or Examine.

@lucasmichaelsengorm
Copy link
Contributor Author

@bjarnef Examine don´t have the problem as Examine is doing what it´s told. But the problem is how Umbraco is configurating the Index, and Field Index. And i would like either to hear @UmbracoHQ why the have configurate the field to be with the CultureInvariantStandardAnalyzer and not the StandardAnalyzer, or if it simply a bug that is configuratet that way.

With Umbraco I can find the problem from version 8.3 - to the newest.

@bjarnef
Copy link
Contributor

bjarnef commented Sep 23, 2021

@lucasmichaelsengorm well regarding the issue I was linking to, it seemed to be how Examine constructed the raw lucene query under the hood (when I restarted app pool the lucene query changed), but it may be a different issue you are seeing.

@lucasmichaelsengorm
Copy link
Contributor Author

@bjarnef you have a much funny question, witch is related to Examine, I´ll say this 2 cases a 2 diffendt things.

Here is the problem if I search for that word, like +nodeName:løn* and will expect results where Løn is part of the nodeName, with the externalIndex. But it is not hitting any results.

If you use this indexer.GetSearcher().CreateQuery().Field("nodeName", "løn".MultipleCharacterWildcard()).Execute() where the index analyzer is set to be CultureInvariantStandardAnalyzer for instanst the internalIndex, it´ll find noting. If you remove the MultipleCharacterWildcard func and search again the the internalIndex, Lucene will parse the term "løn" to "lon", but again not finding anything. if you nativequery with +nodeName:lon* it will find any matching node both machting words starting with løn and lon.

When you see how Umbraco is storing the field nodeName it is with the CultureInvariantStandardAnalyzer and therefore lucene is not able to find the term "Løn" without parsing it to be "lon", but Lucen is not build to parse the stirng with MultipleCharacterWildcard func.

If you change the field to be stored with the StandardAnalyzer and search with indexer.GetSearcher().CreateQuery().Field("nodeName", "løn".MultipleCharacterWildcard()).Execute() you a getting all the node where løn is part of the name.

So the problem here is how Umbraco is telling the fields to be stored.

@Shazwazza
Copy link
Contributor

Hi all,

Here is the explanation of the issue and reasons behind it. It was unclear until now that Lucene does not translate a term that is a wildcard query using the analyzer. The issue is purely for wildcard searches. The reasons why Lucene does not translate a term for a wildcard query can be found here: Shazwazza/Examine#244 (comment)

@Shazwazza
Copy link
Contributor

Shazwazza commented Sep 23, 2021

(oops, sent the last message too soon) ... continued:

The reason why this analyzer exists and is used is to simplify searching and indexing across languages for all users. So if you have the word "løn" it will be analyzed and indexed with 'ascii folding' and it becomes "lon". So now if you search on either "løn" or "lon" you will get the result. This can be a friendlier approach for editors if your site has a lot of languages and your editors don't know the nuances of any given language and it's accents.

The wildcard issue was unapparent until now. I'm unsure the best way to resolve that particular problem currently.

That said, Umbraco Examine is customizable. You are more than welcome to change the default analyzer and field types for any of the indexes and perhaps that makes sense for your installation.

@bjarnef
Copy link
Contributor

bjarnef commented Sep 23, 2021

@Shazwazza is there a recommended approach to search using Examine fluent API and wildcard to match many Danish words including æ, ø and å?

I have this on a project, but it seems it by default doesn't find results when search term gad æ, ø and å... but when a replace the characters, e.g. æ => ae, ø => o and å => a

public class SearchSurfaceController : SurfaceController
{
        private readonly IExamineManager _examineManager;
        private readonly IUmbracoContextAccessor _umbracoContextAccessor;

        public SearchSurfaceController(IExamineManager examineManager, IUmbracoContextAccessor umbracoContextAccessor)
        {
            _examineManager = examineManager;
            _umbracoContextAccessor = umbracoContextAccessor;
        }

        [ChildActionOnly]
        public ActionResult Search(string q = "", int p = 1, int ps = 12)
        {
            var result = new PagedResult<IPublishedContent>(0, 1, ps);

            if (!q.IsNullOrWhiteSpace() && _examineManager.TryGetIndex(Constants.UmbracoIndexes.ExternalIndexName, out var index))
            {
                var searcher = index.GetSearcher();
                var query = searcher.CreateQuery(IndexTypes.Content)
                    .RangeQuery<DateTime>(new[] { "createDate" }, DateTime.MinValue, DateTime.MaxValue)
                    .And()
                    .GroupedNot(new[] { "templateID" }, new [] { "0" });

                var searchFields = new[] { "nodeName", "sections", "contents" };

                if (!string.IsNullOrEmpty(q))
                {
                    var searchTerms = Tokenize(q);
                    query.And().GroupedOr(searchFields, searchTerms.Select(x => x.MultipleCharacterWildcard()).ToArray());
                }

                var results = query.Execute(ps * p);

                var totalResults = results.TotalItemCount;
                var pagedResults = results.Skip(ps * (p - 1));

                var items = pagedResults.ToPublishedSearchResults(_umbracoContextAccessor.UmbracoContext.Content)
                                        .Select(x => x.Content);

                result = new PagedResult<IPublishedContent>(totalResults, p, ps)
                {
                    Items = items
                };
            }

            return PartialView("SearchResults", result);
        }

        public IEnumerable<string> Tokenize(string input)
        {
            return Regex.Matches(input, @"[\""].+?[\""]|[^ ]+")
                .Cast<Match>()
                .Select(m => m.Value.Trim('\"').ToLower())
                .ToList();
        }
}

@lucasmichaelsengorm
Copy link
Contributor Author

@Shazwazza But as far as I know, is that lucen is running charset of iso-8859-8 witch contains æ, ø, å out of the box.
It´ll make more sense to find the key to the evil, right now we know there is problem with the Danish language, but if test, with both sweedish and germany, we have same problem if you search witch words contain char of ä, or ü.

For the standard solution Umbraco is giving out of the box, it look like it only support english 100%. Witch is kind of sad :(

@Shazwazza
Copy link
Contributor

Hi all. I will recap the issue - we know the 'key to the evil':

  • Umbraco ships with the CultureInvariantStandardAnalyzer for the internal index because it was intended to simplify searching for back office data regardless of your language so if you have mixed editors from different languages, they don't need to know specific char accents. The CultureInvariantStandardAnalyzer does "Ascii folding" so converts all unicode accented chars to their equivalent ascii chars.
  • This does work, but it doesn't work with wildcards - which was only discovered NOW

Moving forward: Figure out the best approach now that we know this doesn't work for accented languages + wildcard queries. There's probably a few options but ultimately there will be no 'perfect' solution for everyone's website. In those cases, folks should configure an appropriate analyzer for what works for them.

Possible solutions:

  • Change Examine to allow for Umbraco to specify an option to run the lucene query through the analyzer before appending wildcard chars. Mentioned here with possible approaches Search with danish char. Shazwazza/Examine#244 (comment)
  • Just use the StandardAnalyzer as default for the internal/members indexes, or possibly an even simpler analyzer. This will not convert chars.

@lucasmichaelsengorm
Copy link
Contributor Author

Hello @Shazwazza
thanks for the recape. I just have a question, or simple I just need at little bit more details.

For my code I use the ExternalIndex, witch is using the StandardAnalyzer, but then I still need to override, the Field analyzer to be the standardAnalyzer as well. Is there a simple way to override so the Fields are stored with the standardAnalyzer?

@umbrabot
Copy link

Hiya @lucasmichaelsengorm,

Just wanted to let you know that we noticed that this issue got a bit stale and might not be relevant any more.

We will close this issue for now but we're happy to open it up again if you think it's still relevant (for example: it's a feature request that's not yet implemented, or it's a bug that's not yet been fixed).

To open it this issue up again, you can write @umbrabot still relevant in a new comment as the first line. It would be super helpful for us if on the next line you could let us know why you think it's still relevant.

For example:

@umbrabot still relevant
This bug can still be reproduced in version x.y.z

This will reopen the issue in the next few hours.

Thanks, from your friendly Umbraco GitHub bot 🤖 🙂

@umbrabot umbrabot added the status/stale Marked as stale due to inactivity label Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/stale Marked as stale due to inactivity type/bug
Projects
None yet
Development

No branches or pull requests

4 participants