Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected tokenisation for string datatype #2338

Closed
ju-bezdek opened this issue Nov 2, 2022 · 3 comments
Closed

Unexpected tokenisation for string datatype #2338

ju-bezdek opened this issue Nov 2, 2022 · 3 comments
Labels

Comments

@ju-bezdek
Copy link

As a result of this thread
https://weaviate.slack.com/archives/C017EG2SL3H/p1667377648575099

it was discovered that string default tokenisation is word... which is very confusing, since in every programming language compare operators with string field have some expected outcome...

More over... it is not consistent with the docs... here:
https://weaviate.io/developers/weaviate/current/schema/datatypes.html#datatype-string-vs-text

is stated

DataType: string vs. text
There are two datatypes dedicated to saving textual information: string and text. string values are indexed as one token, whereas textvalues are indexed after applying tokenization.

Which is not true according to @etiennedi

Proposed solution:

  • the default tokenisation for string should be field
  • making the docs more clear about this...

citing @etiennedi:

 The main difference between text/string is whether non-alphanumeric characters are indexed or not.

@etiennedi
Copy link
Member

This should become obsolete with the changes outlined in #2751. Do you agree? Then I'll close this one.

@ju-bezdek
Copy link
Author

Yes, if it has been already decided to deprecate text, then this issue is irrelevant,
Although having string datatype as a pure value storage make sense to me, I can see that #2751 has been already planned, so this can be closed

@etiennedi
Copy link
Member

Although having string datatype as a pure value storage make sense to me

Yes, we will keep the option of turning off any indexing (and, therefore any tokenization).

Thanks, closing this one.

@etiennedi etiennedi closed this as not planned Won't fix, can't repro, duplicate, stale Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants