Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exact match doesn't seem to work #921

Closed
mrusme opened this issue Dec 19, 2021 · 14 comments
Closed

Exact match doesn't seem to work #921

mrusme opened this issue Dec 19, 2021 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@mrusme
Copy link

mrusme commented Dec 19, 2021

Describe the bug

The exact search doesn't seem to be working.

Steps to reproduce (if applicable)
Steps to reproduce the behavior:

 ▲ quickwit index search --index-id wikipedia --metastore-uri file://$(pwd)/wikipedia --query 'title:apollo AND 11' | jq '.hits[].title[]'
"Apollo"
"Apollo 11"
"Apollo 8"
"Apollo program"
"Apollo 13"
"Apollo 7"
"Apollo 9"
"Apollo 1"
"Apollo 10"
"Apollo 12"
"Apollo 14"
"Apollo 15"
"Apollo 16"
"Apollo 17"
"List of Apollo astronauts"
"Apollo, Pennsylvania"
"Apollo 13 (film)"
"Apollo Lunar Module"
"Apollo Guidance Computer"
"Apollo 4"

Okay, so it seems we've found what we're looking for as a second result.
However, since the article as literally named Apollo 11 we should be able to
perform what (according to quickwit's documentation) seems to be an exact
search:

▲ quickwit index search --index-id wikipedia --metastore-uri file://$(pwd)/wikipedia --query 'title:"Apollo 11"' | jq '.hits[].title[]'

Expected behavior

The "Apollo 11" result should be showing up.

System configuration:

60f897c0f49b4a920948b2bb98ca081f5557ed22 built from source on Linux, rustc 1.56.1

Additional context

@mrusme mrusme added the bug Something isn't working label Dec 19, 2021
@fmassot
Copy link
Contributor

fmassot commented Dec 19, 2021

Thanks @mrusme for the bug report. Can you share your index schema? I just want to make sure that you can do phrase queries. If the schema does not specify that positions are stored, then quickwit should return an error.

@mrusme
Copy link
Author

mrusme commented Dec 19, 2021

@fmassot thanks for the quick reply. That's the schema I'm using:

{
  "version": 0,
  "index_id": "wikipedia",
  "index_uri": "file:///home/mrus/projects/@mrusme/ulpia/wikipedia",
  "search_settings": {
    "default_search_fields": ["title", "section_texts"]
  },
  "doc_mapping": {
    "store_source": true,
    "field_mappings": [
      {
        "name": "title",
        "type": "text"
      },
      {
        "name": "section_titles",
        "type": "array<text>"
      },
      {
        "name": "section_texts",
        "type": "array<text>"
      },
      {
        "name": "interlinks",
        "type": "array<text>",
        "indexed": false,
        "stored": false
      }
    ]
  }
}

Since I couldn't find any documentation for the latest master I had to scramble together the config myself. I might make a mistake there?

@fmassot
Copy link
Contributor

fmassot commented Dec 19, 2021

sorry for the not updated documentation, we are working on it right now and it will be in line with the code in 2 weeks.

With your current schema,quickwit should return an error, remove the jq part and look at the output.

Here is the schema you need to have to make phrase queries:

{
  "version": 0,
  "index_id": "wikipedia",
  "index_uri": "file:///home/mrus/projects/@mrusme/ulpia/wikipedia",
  "search_settings": {
    "default_search_fields": ["title", "section_texts"]
  },
  "doc_mapping": {
    "store_source": true,
    "field_mappings": [
      {
        "name": "title",
        "type": "text"
        "record": "position" // Record position will enable phrase query on body field.
      },
      {
        "name": "section_titles",
        "type": "array<text>"
      },
      {
        "name": "section_texts",
        "type": "array<text>"
      },
      {
        "name": "interlinks",
        "type": "array<text>",
        "indexed": false,
        "stored": false
      }
    ]
  }
}

@mrusme
Copy link
Author

mrusme commented Dec 19, 2021

Unfortunately it doesn't return an error (or at least it doesn't look like one):

 ▲ quickwit index search --index-id wikipedia --metastore-uri file://$(pwd)/wikipedia --query 'title:"Barack Obama"'
{
  "numHits": 0,
  "hits": [],
  "elapsedTimeMicros": 4614
}

Also, even with | jq an error should be visible, since jq only reads from stdin, not from stderr in that case.

@fmassot
Copy link
Contributor

fmassot commented Dec 19, 2021

Ok, it's a bug then, I need to check that.

@fmassot fmassot self-assigned this Dec 19, 2021
@mrusme
Copy link
Author

mrusme commented Dec 19, 2021

Also one more thing: I applied your change to my config and performed the following command:

quickwit index ingest --index-id wikipedia --metastore-uri file://$(pwd)/wikipedia --data-dir-path ./wikipedia-data --input-path enwiki-latest-pages-articles.json --overwrite

It seems like the overwrite flag doesn't do much here though. I assume I need to rm -rf wikipedia and start from scratch, correct?

@fmassot
Copy link
Contributor

fmassot commented Dec 19, 2021

It seems like the overwrite flag doesn't do much here though. I assume I need to rm -rf wikipedia and start from scratch, correct?

Yes exactly, you need to reindex the dataset and the simplest way to do it is to create again the index with --override flag.

@fmassot
Copy link
Contributor

fmassot commented Dec 19, 2021

sorry it's --overwrite, not --override

@fmassot
Copy link
Contributor

fmassot commented Dec 19, 2021

I've just tested a query like this on the tutorial Wikipedia dataset:

➜  quickwit git:(main) ✗ cargo r --release index search --index-id wikipedia --query 'body:"Barack Obama"' | jq '.hits[].title[]'
"Speaker of the United States House of Representatives election, October 2015"
"American SAFE Act of 2015"
"Sheila Gwaltney"
"Haben Girma"
"Assistant Secretary of State for Conflict and Stabilization Operations"
"Dixie Highway in Florida"
"Catherine A. Novelli"
"Walter Naegle"
"2015 Xi Jinping visit to the United States"
"Compact of Mayors"

Here is my index config:

version: 0
index_id: wikipedia
index_uri: file:///Users/fmassot/Documents/quickwit/indexes/wikipedia

doc_mapping:
  field_mappings:
    - name: title
      type: text
      tokenizer: default
      record: position
    - name: body
      type: text
      tokenizer: default
      record: position

search_settings:
  default_search_fields: [title, body]

I have 10 hits.

@fmassot
Copy link
Contributor

fmassot commented Dec 19, 2021

@mrusme I will try your query on the complete dataset.

@mrusme
Copy link
Author

mrusme commented Dec 19, 2021

Yeah, just finished re-indexing and now my results for the example I gave initially look like this:

"Apollo 11"
"Apollo 11 (disambiguation)"
"Apollo 11 in popular culture"
"Apollo 11 missing tapes"
"Apollo 11 goodwill messages"
"British television Apollo 11 coverage"
"Apollo 11 (1996 film)"
"Apollo 11 lunar sample display"
"Apollo 11 Cave"
"Moonshot: The Flight Of Apollo 11"
"Apollo 11 50th Anniversary commemorative coins"
"Apollo 11 anniversaries"
"Apollo 11 (2019 film)"

It's better now and I assume for exact matches I could always pick the very first match. However, it would be nice to have an exact exact match, which literally only shows a hit for an entry that's named "Apollo 11" or no hit at all of there is none. :)

@fmassot
Copy link
Contributor

fmassot commented Dec 19, 2021

I opened a dedicated issue for the search command here #922. It will be fixed soon, thanks again for the report.

It's better now and I assume for exact matches I could always pick the very first match. However, it would be nice to have an exact exact match, which literally only shows a hit for an entry that's named "Apollo 11" or no hit at all of there is none. :)

I see yes but when I look at other search engine and how tantivy is doing queries, we have several choices to do what you want:

  1. use a costly regex query (possible in tantivy but not in quickwit today)
  2. instead of indexing title tokens, you can use the raw tokenizer that will index the title as one string. Then a query "Apollo 11" query will give you only 1 result. But a query like "Apollo" will not match a document with "Apollo 11".
  3. just do a top 10 query and filter out what you don't need.

I would go for the 3rd solution :)

@fulmicoton
Copy link
Contributor

We should add BM25 scoring for this use case. That would help.

@fmassot
Copy link
Contributor

fmassot commented Dec 20, 2021

@mrusme the search command bug has been fixed here #923

Let's close this issue and maybe add a dedicated one for the BM25 scoring? @fulmicoton ?

@fmassot fmassot closed this as completed Dec 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants