Releases: weaviate/weaviate
0.22.7 - Improved Contextual Classification & Bugfixes
Docker image/tag: semitechnologies/weaviate:0.22.7
See also: example docker compose files in English, German, Dutch, Italian and Czech.
Breaking Changes
New Features
-
Improved Contextual classification algorithm (#1125)
Prior to this released a contextual classification would often yield false positive for whichever label is closest to the "noise center". This means we would overweigh filler- and stop words and not pay enough attention to the most important words.As we compare a data object to its label in a contextual classification, rather data to other data as in a knn-type classification, this issue was far more prevalent in a contextual classification than in one of type knn. In the latter the noise would be present among all data objects, so it was likely to be cancelled out. However, in data objects with (long) texts the contextual classification suffered.
This release introduces a complete rewrite of the classification algorithm. Instead of weighing each word purely on it's occurrence in the Contextionary, we know weigh (and even remove) words based on two new metrics: Information Gain and tf-idf.
Information Gain is a custom measure to predict how likely a given word is going to influence the classification towards a specific target (label). For example imagine the data object
"I love my new computer"
with the possible labels"Technology", "Food", "Politics"
. When looking at each word in the source object Weaviate would identify "computer" as the word with the highest information gain as it would clearly move the vector towards one of the categories ("computers"). The other words might point to either of the categories without a clear favorite. Thus their information gain should be lower. As a result weaviate will weigh"computer"
the highest in the data object.Tf-Idf, on the other hand, does not compare the data objects directly to a target (label), but rather to other objects. If multiple objects exist such as
"My new computer is great!", "Who is the new president?", "New dishes on the menu!"
, the word"new"
is identified to occur in every object, it thus has an Inverse Document Frequency of 0. Based on user configuration this word can be removed from vectorization entirely.The new mechanisms are user-configurable. They come with reasonable defaults that will work for many datasets, but the get the most out of your classification, it might make sense to tweak them until you get the best possible results. For a detailed list and explanation of the newly introduced parameters, see this comment.
Benchmark
In a benchmark based on the 20 news group data set we have seen a substantial improvement in success rates:
Note that this benchmark was done using a contextual classification, i.e. without training data (labeled data). The success rates are therefore not comparable to other mechanisms which rely on training data. If you want to compare Weaviate's perfomance with other classifications mechanisms which require labelled data, please run a
kNN
classification instead.Main Category
The posts were to be categorized as one of 6 categories (expected success rate for random distribution ~16,7%)
Granular Category
The posts were to be categorized as one of 20 categories (expected success rate for random distribution ~5%)
Goal Previous (<0.22.7) Improved Algorithm (>= 0.22.7) Main Category 18% 58% Granular Category 10% 42% The following settings were used:
# dataset n: 563 # randomly picked with a roughly equal size per category # configuration type: contextual informationGainCutoffPercentile: 10 informationGainMaximumBoost: 3 tfidfCutoffPercentile: 80
Fixes
- Fix unexpected behavior on geoCoordinates 0,0 (#825)
GeoCoordinates of 0,0 - infamously known as Null Island - would lead to the geoCoordinates property disappearing entirely as 0 also happens to be the null/initial value for a property of typefloat
in Golang. This release fixes this and we explicitly display a 0-Coordinate as such now.
0.22.6 - Filter objects by count of references
Docker image/tag: semitechnologies/weaviate:0.22.6
See also: example docker compose files in English, German, Dutch, Italian and Czech.
Breaking Changes
none
New Features
-
Filter objects by count of references (#1101)
Weaviate has already offered substantial "filter by references" capabilities in the past, such as "Find all Cities located in a Country with a population size larger than x". However, prior to this release it was not possible to filter for cases such as "Show all Cities not associated with a Country" or "Find all authors who wrote at least 2 articles".This release adds the ability to filter by reference count. To do so, simply provide one of the existing compare operators (
Equal
,LessThan
,LessThanEqual
,GreaterThan
,GreaterThanEqual
) and use it directly on the reference element. For example, the following GraphQL query:{ Get { Things { Author( where:{ valueInt: 2 operator:GreaterThanEqual path: ["WroteArticles"] } ) { name WroteArticles { ... on Article { title } } } } } }
Note: The example above uses the News Publication dataset.
Fixes
none
0.22.5 - More hypertext references in API & Important Contextionary Fix
Docker image/tag: semitechnologies/weaviate:0.22.5
See also: example docker compose files in English, German, Dutch, Italian and Czech.
Breaking Changes
none
New Features
-
Hypertext Links on API root (#1108, #1103)
Prior to this, accessing the path/
would return404 Not Found
. This was changed as follows:/
redirects (301 Moved Permanently
) to/v1
which is the api base. If the client does not automatcially follow redirects, a json is presented which contains the link to/v1
/v1
shows a list of main APIs and links to documentation for each resource group. Note this is not a complete list, as the intention is not to list every possible option (We have the swagger document for this). Instead the links work like website links where on the root page you are a presented with a few main cateogories.- If the
origin
optioned is configured in the weavite config, an absolute URI is used. This can be helpful when weaviate is running behind a reverse proxy (which is most likely the case in a production setting). Then weaviate has no way of knowing how the user accesses it without it being explicitly configured. If theorigin
config is not set, links do not default to the listen/bind address as origin, instead relative links are presented.
-
Hypertext Links cross-references (#1106)
Similar to the API root links, all REST endpoints which can show cross-references now include a read-only fieldhref
alongside the existingbeacon
field. The field contains an HTTP Hypertext Reference to the resprective resources. The same behavior regardingorigin
in the config and absolute vs relative URIs as outline above applies to these links as well.
Fixes
-
Memory leak fixed in contextionary (weaviate/contextionary#25)
We discovered a potential memory leak in a library used in the contextionary. In some cases after long import sessions the contextionary memory usage would keep growing without a limit. We have replaced the code from the external library with custom code in semi-technologies#26 thus fixing the issue.The docker-compose files linked above already reference the new version. If you are running your own setup or the K8s setup via the official helm chart, make sure you reference version
<language>0.14.0-v0.4.8
or higher.
0.22.4 - New contextionary languages added
Docker image/tag: semitechnologies/weaviate:0.22.4
See also: example docker compose files in English, German, Dutch, Italian and Czech.
Breaking Changes
none
New Features
- New contextionary languages added for contextionary version
xx0.13.0-v0.4.7
. See the links above for example docker-compose files for supported languages. You can use the linked contextionary images in other setups (Kubernetes, Helm) as well.
Fixes
none
0.22.3 - Bugfixes and Vector as part of Object's Meta
Docker image/tag: semitechnologies/weaviate:0.22.3
See also: example docker compose files in english and dutch.
Breaking Changes
none
New Features
-
Return objects' vector position when
meta=true
(#1041)
As part of the classification feature ameta
option (passed as a query parameter) was added to theGET /v1/things
andGET /v1/actions
API. If the object was part of a classification, meta information about that classification is printed. Additionally, the meta object will now - regardless of classifications - also contains the objects vector position.Keep in mind that the 600-dimensional vector is about 2.4KB of size in the underlying storage and about twice that size when encoded as float numbers in json. So you will add about 5KB of data per object when setting
meta=true
. While this is negligible on single objects, the additional data to be transferred on long list queries might add up to a lot of additional traffic. So, only set this option if really necessary.
Fixes
-
Bug:
?meta=true
ignored on list queries (#1099)
Prior to this release setting themeta=true
query param worked onGET /v1/things/{id}
(single object), but not onGET /v1/things
(list of objects). This releases fixes this and makes suremeta=true
can now be set on both types ofGET
queries -
Bug: Numbers and other characters lead to error in
/c11y/concepts
endpoint (#1078)
The requirements for class names and other schema fields have been loosened in the past. As of now any utf-8 letter or digit is an acceptable character. However, the/c11y/concepts
endpoint. which can be used to inspect word concepts in the contextionary space, still validated a strict[A-Za-z]
. This has been changed and now all utf-8 letters and digits are acceptable.
0.22.2 - Parse and Normalize Phone Numbers
Docker image/tag: semitechnologies/weaviate:0.22.2
See also: example docker compose files in english and dutch.
Breaking Changes
none
New Features
-
Upgrade to Go 1.14 (#1090)
No user-facing changes. Even for contributors it's very unlikely that this update introduced any changes. But we recommend updating your Go environment to the latest version if you plan on contributing to Weaviate. Thanks. -
New Data Type:
phoneNumber
(#1088 and #1087)
A new data type with the namephoneNumber
was added. This type is a primitive type liketext
,string
, etc - as compared to reference type. Similar to the existing typegeoProperties
, the new type contains more than a single field.The full type definition can be seen in the
swagger.json
definitionUsage
There are two user-settable sub-fields (
input
anddefaultCountry
).input
must always be set when using the type,defaultCountry
must only be set in specific situations:- When you enter an international number (e.g.
+49 171 1234567
) nodefaultCountry
must be entered, as the underlying parser will recognize that the above is a German number due to the+49
prefix - When you enter the same number as above in a national format (e.g.
0171 1234567
), you need to specify thedefaultCounty
(in this case:"de"
), so that the parse can correctly convert the number into all formats.
Inputs and Formats
phoneNumber.input
is of typestring
. You can enter any phone number. Optional digits, such as an optional0
(e.g.+49 (0) 171 ....
) will be automatically recognized and normalized. Furthermore all formatting helpers, such as dashes or spaces are being removed by the parser.phoneNumber.defaultCountry
is of typestring
. See "Usage" above on when this optional field is required. Content should be entered as ISO 3166-1 alpha-2 country codes.
Read-only fields after parsing
When reading back a field of type phone number, the following (read-only) fields appear:
internationalFormatted
(string
): Phone number in international format, e.g."+49 171 123456"
national
(unsigned integer
): National part of the phone number, eg.171123456
nationalFormatted
(string
): Phone number in national format, eg."0171 123456"
countryCode
(unsigned integer
): Country-code digits, e.g.49
valid
(boolean
): Whether the parser recognized the phone number as validinput
(string
): The raw phone number as put in by the user (helpful for debugging purposes), see Usage abovedefaultCountry
(string
) The default country as put in by the user, only set if explicitly set by the user, see Usage above
Limitations
The following phone-related features are not yet part of the above release
- Search by phone numbers (#1089)
- Aggregate phone numbers
- When you enter an international number (e.g.
Fixes
none
0.22.1 - Influence Weights in Vector Creation
Docker image/tag: semitechnologies/weaviate:0.22.1
See also: example docker compose files in english and dutch.
Breaking Changes
none
New Features
-
Override weights on vector creation (#1070 and #1074)
Prior to this release the weight of each individual word when creating a vector from an object was out of the user's control. The contextionary uses an algorithm based on the general occurrence of the word in its training data, to suggest how each word should be weighted. The underlying assumption is that a rare word should take more precedence over a very common word, similar to tf-idf.This works well in most cases, but in some use-case specific domain languages common words get a new meaning and therefore their importance should change. Imagine the words "far" and "near". They are quite common in overall language, so - especially when mixed with rarer words - they wouldn't get a great weight. However, now assume you're in the domain of optometry or manufacturing glasses. In the terms "far-sighted" and "near-sighted", the words "near" and "far" make a very important distinction. Imagine you were trying to classify objects based on those terms. With the changes in 0.22.1 you can now influence - or even completely override - the weights of individual words when creating vectors.
To do so, the field
vectorWeights
was introduced to theThing
andAction
objects. The field is a key-value map where both the keys and the values must be strings. The keys are the words you want to influence and the value is a mathematical expression to set the new weight. You can use additions, subtractions, multiplications, divisions or simply overwrite the weight with a fixed number. To reference the original weights, use the single-letter variablew
. Some examples:-
"vectorWeights": {"far": "10 * w"}
Give the word "far" 10 times its original weight -
"vectorWeights": {"far": "w + 0.5", "near": "w - 0.5"}
Give the word "far" an absolute boost of 0.5, while penalizing the word "near" by 0.5. -
"vectorWeights": {"sighted": "0.7", "glasses": "2 - 4 * w"}
Let the word "sighted" have a fixed weight of 0.7 whereas the word "glasses" is calculated by subtracting 4 times the original weight from the number 2.
Some important things to note:
- For this feature to work you need a contextionary version of at least
...v0.4.7
. The example docker-compose files linked above have already been updated to the required version. - Spaces in math expressions have no meaning.
- A word that is not referenced in "vectorWeights" will simply use its original weight as returned by the contextionary.
- Custom vectorWeights only affect the object which they are set on, there is no option to globally manipulate a specific word. If the same vectorWeights are required for multiple objects, simply attach them to all objects where needed.
- Whenever the mathematical expression is not a fixed number (such as
"17"
) an operator must be present. It is not valid to use implicit operators, such as"2w"
which would mean "two times the original weight". In this case explicitly use the multiplication operator, e.g."2 * w"
or"w*2"
.
Full example
Here's a full example for importing a thing object
POST /v1/things
{ "class": "Glasses", "schema": { "description": "These glasses are meant for far-sighted people" }, "vectorWeights": { "far": "5 * w", "near": "5 * w" } }
The above example will boost the words "far" or "near" by a factor of 5. Note that the object does not contain the word "near", so only the word "far" is boosted. The other unreferenced words maintain their original weights.
-
Fixes
none
0.22.0 - Updated Cross-Reference Storage Strategy
Docker image/tag: semitechnologies/weaviate:0.22.0
See also: example docker compose files in english and dutch.
Contains Breaking Change!
Note: While this release contains no API-level breaking changes, the internals have changes so much, that we recommend not to simply replace your existing Weaviate container with the new one. Instead you should create a new cluster and reimport our things and actions. See changelog below for more detailed reasons why.
Breaking Changes
-
Improve cross-reference storing strategy (#1069)
Prior to this release Weaviate would build an automated cache of referenced objects. This led to very fast response time for nested queries, at the cost of large disk usage. We have since learned that disk usage can be so excessive in heavily connected graphs that the benefits don't outweigh the costs. In addition configuring cache boundaries led to unnecessary complexity.The major goal of 0.22.0 was to replace automated denormalization caching with a smarter strategy without losing the snappiness of cached results and the overall low latencies of queries our users have come to appreciate.
We believe we have found a good strategy with this release, by implementing smarter query strategies to keep inter-container traffic to a minimum and use our backing storage in a way it performs well.
This boils down to the following advantages that 0.22.0 provides over 0.21.x:
- Feature parity No feature got lost through the rewrite. If it worked with 0.21.x it works with 0.22.x. If you think otherwise, please open an issue
- Much smaller disk footprint Since we don't excessively normalize references anymore, the disk footprint got much smaller. Essentially the size on disk is now
(object size + vector size + index overheads) * desiredReplication
. The amount of cross-references no longer has a direct impact on disk space (other than storing the link itself which is effectively the size of the bytes in aweaviate://...
beacon) - No depth limit on nested filters Prior to this release a filter on a cross ref prop, such as
path: ["inCity", "City", "inCountry", "Country", "name"]
had a limit. It would only work within a cache boundary. This limitation is now gone and you can filter as deep as you like. Please note that an excessively deep query will have a perfomance impact. - Smaller CPU impact during imports Prior to this release we'd spent a share of the available resources on building a denormalized cache asynchronously after importing a connected object. Without having to build such a cache, more performance on imports is available for storing, vectorizing and indexing objects.
Please note that caching was previously done at import time. We recommend not to try to upgrade a 0.21.x cluster, but instead creating a new cluster and reimporting. This is the only way to guarantee your cluster won't have cache leftovers which can impact performance.
New Features
none
Fixes
- #967 became obsolete through this change
0.21.12 - Improved Contextionary Weighing Algorithm
Docker image/tag: semitechnologies/weaviate:0.21.12
See also: example docker compose files in english and dutch.
Breaking Changes
none
New Features
none
Fixes
- Improved Contextionary Weighting Algorithm
This release updates the default contextionary version to...v0.4.6
which includes an improved weighting algorithms. Prior to this release the occurrence-based weighting was done with a linear algorithm. This often led to unimportant words getting too much weight. The latest version uses a logarithmic approach. With this approach we were able to improve the accuracy of classifications done with weaviate.
The example docker-compose files linked above have already been updated. If you're not using them, make sure to update the contextionary version accordingly in your setup.
This change is non-breaking. Keep in mind that object vectorization happens at import time. So if you want all your objects to benefit from the updated algorithm, you should reimport them.
If you aren't happy with the results and would like to use the classic linear approach, you can force the contextionary to do so, by setting the environment variable OCCURRENCE_WEIGHT_STRATEGY=linear
for the contextionary (!) service. It defaults to log
.
0.21.11 - Entity Merging
Docker image/tag: semitechnologies/weaviate:0.21.11
See also: example docker compose files in english and dutch.
Breaking Changes
none
New Features
-
Entity Merging (#975)
Entity merging allows you to deduplicate results. If you have several objects which describe the same physical entity, e.g. "Google Inc." and "Google Incorporated" (they both describe the real-world company "Google"), you can hide duplicates or even let Weaviate merge duplicates into a single entity.Usage
Usage is best described in the following three example screenshots.
No grouping/merging
First up is the behavior without any grouping or merging strategy. As you can see there are a lot of duplicates:Grouping strategy
closest
With strategyclosest
Weaviate tries to build groups based on your results. For each group it will show the results closest to your search query. Note that there is also aforce
field. The higher the force the more likely Weaviate is going to group two objects together. Theforce: 1.0
would mean that every single item, no matter how different should be grouped. Aforce: 0
means that only exactly identical items should be grouped. The example below usesforce: 0.1
as that yielded the best results. You can see that no more company names are duplicated:Grouping strategy
merge
The example above hides duplicates. This isn't an issue if every single field is identical. But what if you need to know the original values. Strategymerge
will keep the contents of the original fields. String fields contain all original values as shown below, numerical fields display a mean and reference fields contain all the references from all merged objects:Best Practices
To get the best possible results, please keep the following things in mind:
- The grouping/merging is done internally based on vector distance. It is thus important that the items to be merged are as close to each other as possible. If your items use a lot of words which are not recognized by the contextionary, those words do not influence the vector position. In this case consider extending the contextionary using the REST API (
/c11y/extensions
), so that it understands more words from your object - You get the best possible results if noise is removed in vectorization, we thus strongly recommend setting
vectorizeClassName: false
andvectorizePropertyName: false
for each property. Those settings were introduced in 0.21.10.
- The grouping/merging is done internally based on vector distance. It is thus important that the items to be merged are as close to each other as possible. If your items use a lot of words which are not recognized by the contextionary, those words do not influence the vector position. In this case consider extending the contextionary using the REST API (
Fixes
none