Unexpected tokenisation for string datatype #2338

ju-bezdek · 2022-11-02T11:12:47Z

As a result of this thread
https://weaviate.slack.com/archives/C017EG2SL3H/p1667377648575099

it was discovered that string default tokenisation is word... which is very confusing, since in every programming language compare operators with string field have some expected outcome...

More over... it is not consistent with the docs... here:
https://weaviate.io/developers/weaviate/current/schema/datatypes.html#datatype-string-vs-text

is stated

DataType: string vs. text
There are two datatypes dedicated to saving textual information: string and text. string values are indexed as one token, whereas textvalues are indexed after applying tokenization.

Which is not true according to @etiennedi

Proposed solution:

the default tokenisation for string should be field
making the docs more clear about this...

citing @etiennedi:

The main difference between text/string is whether non-alphanumeric characters are indexed or not.

etiennedi · 2023-03-21T12:58:45Z

This should become obsolete with the changes outlined in #2751. Do you agree? Then I'll close this one.

ju-bezdek · 2023-03-21T13:05:32Z

Yes, if it has been already decided to deprecate text, then this issue is irrelevant,
Although having string datatype as a pure value storage make sense to me, I can see that #2751 has been already planned, so this can be closed

etiennedi · 2023-03-21T13:08:53Z

Although having string datatype as a pure value storage make sense to me

Yes, we will keep the option of turning off any indexing (and, therefore any tokenization).

Thanks, closing this one.

etiennedi added the backlog label Nov 2, 2022

etiennedi closed this as not planned Won't fix, can't repro, duplicate, stale Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected tokenisation for string datatype #2338

Unexpected tokenisation for string datatype #2338

ju-bezdek commented Nov 2, 2022

etiennedi commented Mar 21, 2023

ju-bezdek commented Mar 21, 2023

etiennedi commented Mar 21, 2023

Unexpected tokenisation for string datatype #2338

Unexpected tokenisation for string datatype #2338

Comments

ju-bezdek commented Nov 2, 2022

etiennedi commented Mar 21, 2023

ju-bezdek commented Mar 21, 2023

etiennedi commented Mar 21, 2023