Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexes Being Dropped #254

Open
darnfish opened this issue Jun 27, 2021 · 3 comments
Open

Indexes Being Dropped #254

darnfish opened this issue Jun 27, 2021 · 3 comments

Comments

@darnfish
Copy link

Problem

I'm having issues where a predictable amount (around 10%) of indexed searches are being dropped. I'm not sure if it's an issue with my dataset, but it doesn't seem so.

This is occurring in development using Sonic 1.3.0 (installing on Homebrew, running with this config) and production on Sonic 1.3.0 (pulling from Docker, running on Kubernetes) environments, both connecting via Node.js on sonic-channel@1.2.6.

FYI: I'm trying to add "interests" to the search index.

Test Script

Screenshot 2021-06-27 at 7 59 14 am

This script indexes a interest and then immediately drops it. If the removed variable is equal to zero, then it means that the search term wasn't properly indexed.

After running on 1,000 interests, the same "The Last of Us Part II" interest (the 504th interest in the array) is consistently not properly being indexed:

Screenshot 2021-06-27 at 8 04 31 am

After upping to 10,000 interests, the same 19 interests are being dropped from the search index:

Screenshot 2021-06-27 at 8 06 09 am

Conclusion

I'm not seeing anything in my logs, changing the collection id is not having an effect either. For "The Last of Us Part II", here are the config params I'm using:

collection: interest
bucket: 2355282563624337408
object: 2451202401969897472

After indexing 40,000~ interests, I'm seeing exactly 4040 being consistently dropped :~(

I'd really appreciate some help with this issue—thanks!

@valeriansaliou
Copy link
Owner

Hello,

This may be normal depending on your Sonic configuration. Sonic has a maximum words per object configured as retain_word_objects (see: https://github.com/valeriansaliou/sonic/blob/master/config.cfg#L35).

Also, regarding the indexing of The Last of Us Part II, I'm pretty sure the ingestion pipeline only sees stopwords there ("The", "Last", "Of", "Us", "Part"), which results in an empty string, which thus does not get ingested at all.

Can you try temporarily altering this game name and add a non-stopwords word just to be sure? (stopwords for English are listed in: https://github.com/valeriansaliou/sonic/blob/master/src/stopwords/eng.rs)

Also, you should probably force English as detected langage in the ingest query, instead of relying on the built-in ngram locale detector, which would pick the wrong langage for a lot of those short strings, and thus rely on a different stopwords dictionary that may interfere. If you deem that all that you ingest is English-based, then you should force "eng" as locale (see: https://github.com/valeriansaliou/sonic/blob/master/PROTOCOL.md#4%EF%B8%8F⃣-sonic-channel-ingest-mode).

@darnfish
Copy link
Author

The stopwords seems to be the culprit :~) I don't think the issue is with retain_word_objects, as each different interest name is much less than 1,000 words each.

How would I now go about solving the stopword issue? Is there any way this can be disabled (not sure how crucial stopwords are to how the library works)

@valeriansaliou
Copy link
Owner

Unfortunately for now there is no way to disable the stopwords "cleanup" system. I'll add a protocol options though to disable that on all requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants