Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full text index does not work as expected #4837

Closed
cangfengzhs opened this issue Nov 8, 2022 · 1 comment
Closed

Full text index does not work as expected #4837

cangfengzhs opened this issue Nov 8, 2022 · 1 comment
Assignees
Labels
affects/none PR/issue: this bug affects none version. priority/hi-pri Priority: high process/done Process of bug severity/major Severity of bug type/bug/functionality Bugs preventing the database to deliver a promised function. type/bug Type: something is unexpected
Milestone

Comments

@cangfengzhs
Copy link
Contributor

How do we use ES for fulltext search?

In NebulaGraph, according to our document, you need to first establish a native index when using full-text indexes. This is against common sense. Because it works in the following way:

data in ES

  • docID: partitionID, schemaID, column_name, and the first 256 bytes of text
  • value: the first 256 bytes of text

The docId is the ID used to uniquely identify a document in elastic search. It has a maximum length limit of 512. Maybe you don't understand why they are like this, or you already understand but are confused. I will explain why they are like this and what problems they have.

write

We limit the maximum length of the text to 256. For convenience, we temporarily assume that the maximum length limit is only 3. The "256" described in the docId and value above is also changed to "3".

insert vertex t (name) values 1: ("abcd")

Let's take the above statement as an example. Assume that a native index is established on tag t.name, index_t_name(3) and the full-text index es_ft_t_name.

When NebulaGraph (storaged) wants to write this vertex, its corresponding raft listener will write this data to ES. Because the maximum length limit is 3 (we assumed earlier), "abcd" will be truncated to "abc" and then written to ES.

read

lookup on t where prefix(t.name, "ab")

Then we will query according to the above statement. When NebulaGraph (graphd) processes the expression prefix(t.name, "ab"), it can recognize that it is a full-text index expression. At this time, it will send a query request to ES: find the data prefixed with "ab". ES will return an "abc" (which we wrote earlier). Then graphd will rewrite prefix (t.name, "ab") to t.name=="abc".

lookup on t where t.name=="abc"

Then we will execute according to this statement. However, we do not have t.name=="abc" data, we only have t.name=="abcd" data. So we can't find any data.

problems

utf8

We directly truncate the string according to the maximum length limit of 256. For Chinese or other complex characters, it is very likely to get an incomplete utf8 character. Such data will report an error when written to ES.

ES could not retrieve the data as we expected

If the characters we want to match are after 256 bytes, they are not written to ES. ES cannot search them.

If a string is longer than 256, it will never be found

As the example above.

solve

Here are three solutions.

Unlimit maximum length in value

This is the simplest way to repair.And I've fixed it in #4836. It is not worse than before. But it will have new problems.

Back to our example. If value has no length limit, our example works normally.However, if we insert "abcd" and then insert an "abce". Then we query according to the lookup statement, and we will only get "abce". Because the docIDs of "abcd" and "abce" are the same, we will overwrite the document of "abcd" in ES when writing "abce".

We cannot remove the length limit of docID, because ES does not allow it to exceed 512 bytes.

refactor docID

We don't use the first 256 bytes of value in docID. Instead, write vid (for tag) or {src, dst, rank} (for edge). Other logics are not changed for the time being, and it is still necessary to establish a native index. However, on the premise of correctness, such changes should be minimal.

The problem here is that I don't find that when we delete a tag/edge, there is a corresponding logic to delete the data in ES. I need further confirmation.

refactor fulltext index (best but hard)

We directly reconstruct the whole full-text indexing logic. The native index is no longer required. Record the vid (src, dst, rank) of the point corresponding to the text directly in ES. We directly search vertices (edges) according to vid (src, dst, rank)

This may take a lot of time

@cangfengzhs cangfengzhs added type/bug Type: something is unexpected priority/hi-pri Priority: high labels Nov 8, 2022
@Sophie-Xie Sophie-Xie added this to the v3.4.0 milestone Nov 8, 2022
@xtcyclist xtcyclist added the type/bug/functionality Bugs preventing the database to deliver a promised function. label Nov 9, 2022
@jinyingsunny jinyingsunny added the severity/major Severity of bug label Nov 10, 2022
@HarrisChu HarrisChu added the affects/none PR/issue: this bug affects none version. label Dec 1, 2022
@cangfengzhs
Copy link
Contributor Author

fixed #5077

@github-actions github-actions bot added the process/fixed Process of bug label Dec 27, 2022
@Hester-Gu Hester-Gu added the process/done Process of bug label Jan 13, 2023
@github-actions github-actions bot removed the process/fixed Process of bug label Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/none PR/issue: this bug affects none version. priority/hi-pri Priority: high process/done Process of bug severity/major Severity of bug type/bug/functionality Bugs preventing the database to deliver a promised function. type/bug Type: something is unexpected
Projects
None yet
Development

No branches or pull requests

6 participants