Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow exact match search via double quotes #1772

Closed
3 tasks done
wilhelmer opened this issue Jun 22, 2020 · 18 comments
Closed
3 tasks done

Allow exact match search via double quotes #1772

wilhelmer opened this issue Jun 22, 2020 · 18 comments
Labels
change request Issue requests a new feature or improvement

Comments

@wilhelmer
Copy link
Contributor

wilhelmer commented Jun 22, 2020

I checked that...

  • ... the documentation does not mention anything about my idea
  • ... to my best knowledge, my idea wouldn't break something for other users
  • ... there are no open or closed issues that are related to my idea

Description

Material supports lunr's search syntax. With that syntax, to perform an exact match search (logical AND search), all terms must be prefixed with a plus (+):

+my +search +term

This is technically fine, but not intuitive. I assume most users try to perform an exact match search by using double quotes, because that's the way it's usually done in Google:

"my search term"

It would be nice if Material supported this as an additional method to perform an exact match search.

Adding some RegEx magic to defaultTransform() should do the trick.

Use Cases

Users that don't know how to do a logical AND search would have a much greater chance to find out by trial and error if double quotes were supported.

@squidfunk
Copy link
Owner

squidfunk commented Jun 22, 2020

This is technically fine, but not intuitive. I assume most users try to perform an exact match search by using double quotes, because that's the way it's usually done in Google

Do you have data to back that up? I've never seen someone using quote syntax in Google Analytics for the official docs.

It would be nice if Material supported this as an additional method to perform an exact match search.

Adding some RegEx magic to defaultTransform() should do the trick.

The regex is already pretty complicated:

export function defaultTransform(value: string): string {
return value
.replace(/(?:^|\s+)[*+-:^~]+(?=\s+|$)/g, "")
.trim()
.replace(/\s+|(?![^\x00-\x7F]|^)$|\b$/g, "* ")
}

I'm not sure whether +my +search +term is semantically the same as "my search term", as putting a query in quotes searches for a phrase (i.e. order is important), while the former syntax searches for occurrences, at least what I understand from Google and other search engines. Also, note that +my +search +term is currently translated to +my* +search* +term*.

@wilhelmer
Copy link
Contributor Author

This is technically fine, but not intuitive. I assume most users try to perform an exact match search by using double quotes, because that's the way it's usually done in Google

Do you have data to back that up? I've never seen someone using quote syntax in Google Analytics for the official docs.

In the past two years, I've had 701 site searches with quotes, 0 searches with the + operator.

Note that the + operator is deprecated in Google Search, so people familiar with Google's syntax will probably avoid it.

I'm not sure whether +my +search +term is semantically the same as "my search term", as putting a query in quotes searches for a phrase (i.e. order is important), while the former syntax searches for occurrences, at least what I understand from Google and other search engines. Also, note that +my +search +term is currently translated to +my* +search* +term*.

Well yes, that is a drawback. We can't exactly replicate Google's search behavior since lunr doesn't support specifying a word order. So +my +search +term will always include results that have "search" in the heading and "term" somewhere way below in a paragraph. Is that such a major drawback that we shouldn't allow search with quotes at all? Not sure.

BTW, I played around with our search and was able to improve the results by boosting the first word: +my^100 +search +term. But that may be anecdotal.

@squidfunk
Copy link
Owner

squidfunk commented Jun 22, 2020

I guess we could implement quotes when we have a spec that is complete. If you can draft something up as part of this issue, we might consider implementing it. Some questions to be answered by this specification:

  • What happens with unmatched quotes? "my search
  • What happens when we have modifiers in quotes? "+my -search" +term
  • What about single quotes? 'my search

As a rule of thumb, we should probably stay as close as possible to Google. Please specify the respective transforms for all cases. Also, maybe there're some more edge cases I haven't thought of.

@squidfunk squidfunk added discussion needs input Issue needs further input by the reporter labels Jun 22, 2020
@squidfunk
Copy link
Owner

On a side note, with the release of Material 5 and the support of the whole Lunr Syntax, we should think of how we can integrate a cheat sheet of how to use the search / docs, etc. For example, search by title, fuzzy search etc. It's quite powerful:

Bildschirmfoto 2020-06-22 um 16 20 44

@michael-nok
Copy link
Contributor

I am not a search expert, but I would like to see this feature implemented as well, so here are my inputs on the subject.

  • What happens with unmatched quotes? "my search

From what I'm seeing in the Google specification, unmatched quotes would be considered as an incomplete query. You may have to disable look ahead search when the query starts with a double quote.

  • What happens when we have modifiers in quotes? "+my -search" +term

Modifiers in quotes are taken as literal characters and not interpreted as modifiers.

  • What about single quotes? 'my search

Single quotes are their own character. They may appear in SQL commands, and therefore do not encapsulate the search query.

@wilhelmer
Copy link
Contributor Author

  • What happens with unmatched quotes? "my search

From what I'm seeing in the Google specification, unmatched quotes would be considered as an incomplete query. You may have to disable look ahead search when the query starts with a double quote.

I think unmatched quotes are simply stripped from the query. So "my search should be equal to my search.

More observations via Google:

Search for " -> unmatched quote is not stripped
Search for "" -> yields no results at all.

Some additional information here: http://www.googleguide.com/quoted_phrases.html

Tidbits:

A quoted phrase is the most widely used type of special search syntax.

So let's support it :-)

Google will search for common words (stop words) included in quotes
Google doesn’t perform automatic stemming on phrases

So disable lunr's pipeline for quoted search terms? Might be difficult to implement?

@squidfunk
Copy link
Owner

squidfunk commented Jun 23, 2020

Okay, so I guess we have to check how to escape modifiers, so they're treated as literals in queries. The stemmer is disabled for most languages, as most people don't like it and wonder why their queries don't return any results. It also doesn't work well with the type-ahead experience. Thus, there's only a trimmer and stopWordFilter in place. Both are only part of the indexing pipeline, not of the search pipeline, which includes only the stemmer by default.

For this reason, I see no possibility to disable the stopWordFilter for searching or even parts of queries. Thus, from what I understand, the only thing we could try to implement would be "my search terms" => +my +search +terms`.

On a second note, there's another problem with some of those modifiers, as - is, by default, part of the tokenizer separator. Thus, all - are lost before adding a document to the index. This might be problematic in regard to the automatic transformation, because "my -search +term" would theoretically become +my +\-search \+term, which may lead to no results. Before implementing this feature, we really need to understand how lunr handles special characters. Or we just ignore modifiers and just do the simple "my search terms" => +my +search +terms` transformation, disregarding any special character cases.

@squidfunk
Copy link
Owner

squidfunk commented Jun 26, 2020

Okay, so I suggest the following proposal:

  • Prepend all terms in double quotes with +: "my search" term becomes +my +search term
  • Escape control characters [1] +, -, ...: "+my -search term" becomes +\+my +\-search +term
  • Ignore unmatched double quotes: "my +search term becomes my +search term

It's not trivial to implement, though. I guess we have to depart from the simple regular expression based transform and have to implement a rather decent parser. Should be possible in a few dozen lines.


[1] See the end of the paragraph on QueryString

@wilhelmer
Copy link
Contributor Author

Thanks! Bullets 1) and 3) are fine for me. 2) looks a little overengineered. I assume the number of users that want to both use exact search and search for control characters is too low to justify the development effort. So we could leave that out, and "+my -search term" becomes ++my +-search +term, which simply yields no results.

@squidfunk
Copy link
Owner

Yes, 2. is probably not much of a use case.

@squidfunk squidfunk added change request Issue requests a new feature or improvement and removed needs input Issue needs further input by the reporter proposal labels Jun 26, 2020
@squidfunk
Copy link
Owner

Implemented in #1778. Feedback appreciated so we can push this out quickly.

export function defaultTransform(value: string): string {
return value
.split(/"([^"]+)"/) /* => 1 */
.map((terms, i) => i & 1
? terms.replace(/^\b|^(?![^\x00-\x7F]|$)|\s+/g, " +")
: terms
)
.join("")
.replace(/"|(?:^|\s+)[*+\-:^~]+(?=\s+|$)/g, "") /* => 2 */
.trim() /* => 3 */
.replace(/\s+|(?![^\x00-\x7F]|^)$|\b$/g, "* ") /* => 4 */
}

Also, see the comments in the doc block of the defaultTransform function.

@wilhelmer
Copy link
Contributor Author

wilhelmer commented Jun 29, 2020

Great, thanks! There's no global modifier in .split(/"([^"]+)"/), meaning that only the first quoted phrase will be modified.

E.g. "my search term" is "a great term" -> +my +search +term is "a great term" -> 0 results.

We haven't discussed that - I think it's also not much of a use case, so we can leave it as it is. Just wanted to mention.

@squidfunk
Copy link
Owner

Thanks for pointing it out, we should add it!

@wilhelmer
Copy link
Contributor Author

Okay, if you change it anyway, you can also fix some small typos in the comment:

* to and `AND` query (as opposed to the default `OR` behavior). While users

and -> an

* may expected terms enclosed in quotation marks to map to span queries,

expected -> expect

* asterisik (wildcard) in between terms, which can be denoted by whitespace,

asterisik -> asterisk

@squidfunk
Copy link
Owner

All settled in 759f2b9.

@wilhelmer
Copy link
Contributor Author

Looks good! If there's a space after the first quote, the query will be transformed like this:

" my search term" -> + +my +search +term

At first, I thought this might be a problem, but lunr seems to handle that and remove orphan control characters. So no problem.

@squidfunk
Copy link
Owner

Released as part of 5.4.0

@MaximilianKohler
Copy link

MaximilianKohler commented Sep 1, 2023

Why isn't this working for me? I have tried both of these:

plugins:
  - search:
      separator: '[\s\-,:!=\[\]()"/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;'
plugins:
  - search

I searched for example phrase or "example phrase" and neither one returns any result.

EDIT:

Oh, it seems to have to do with some word-count limit in the search, but it also seems bugged in other ways. If I search for "Cariogenic microbiome" it will return results for both cariogenic and microbiome. If I search for Cariogenic microbiome and microbiota of the early primary dentition it will give me tons of results but not ranked in the "most matches" or "best matches" order, so it's useless. If I search for "Cariogenic microbiome and microbiota of the early primary dentition" I get no results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change request Issue requests a new feature or improvement
Projects
None yet
Development

No branches or pull requests

4 participants