29 Apr 12:52

084bf3f

0.22.7 - Improved Contextual Classification & Bugfixes

Docker image/tag: semitechnologies/weaviate:0.22.7
See also: example docker compose files in English, German, Dutch, Italian and Czech.

Breaking Changes

New Features

Improved Contextual classification algorithm (#1125)
Prior to this released a contextual classification would often yield false positive for whichever label is closest to the "noise center". This means we would overweigh filler- and stop words and not pay enough attention to the most important words.

As we compare a data object to its label in a contextual classification, rather data to other data as in a knn-type classification, this issue was far more prevalent in a contextual classification than in one of type knn. In the latter the noise would be present among all data objects, so it was likely to be cancelled out. However, in data objects with (long) texts the contextual classification suffered.

This release introduces a complete rewrite of the classification algorithm. Instead of weighing each word purely on it's occurrence in the Contextionary, we know weigh (and even remove) words based on two new metrics: Information Gain and tf-idf.

Information Gain is a custom measure to predict how likely a given word is going to influence the classification towards a specific target (label). For example imagine the data object "I love my new computer" with the possible labels "Technology", "Food", "Politics". When looking at each word in the source object Weaviate would identify "computer" as the word with the highest information gain as it would clearly move the vector towards one of the categories ("computers"). The other words might point to either of the categories without a clear favorite. Thus their information gain should be lower. As a result weaviate will weigh "computer" the highest in the data object.

Tf-Idf, on the other hand, does not compare the data objects directly to a target (label), but rather to other objects. If multiple objects exist such as "My new computer is great!", "Who is the new president?", "New dishes on the menu!", the word "new"is identified to occur in every object, it thus has an Inverse Document Frequency of 0. Based on user configuration this word can be removed from vectorization entirely.

The new mechanisms are user-configurable. They come with reasonable defaults that will work for many datasets, but the get the most out of your classification, it might make sense to tweak them until you get the best possible results. For a detailed list and explanation of the newly introduced parameters, see this comment.

Benchmark

In a benchmark based on the 20 news group data set we have seen a substantial improvement in success rates:

Note that this benchmark was done using a contextual classification, i.e. without training data (labeled data). The success rates are therefore not comparable to other mechanisms which rely on training data. If you want to compare Weaviate's perfomance with other classifications mechanisms which require labelled data, please run a kNN classification instead.

Main Category

The posts were to be categorized as one of 6 categories (expected success rate for random distribution ~16,7%)

Granular Category

The posts were to be categorized as one of 20 categories (expected success rate for random distribution ~5%)

Goal Previous (<0.22.7) Improved Algorithm (>= 0.22.7)

Main Category 18% 58%

Granular Category 10% 42%

The following settings were used:
```
# dataset
n: 563 # randomly picked with a roughly equal size per category

# configuration  
type: contextual
informationGainCutoffPercentile: 10 
informationGainMaximumBoost: 3 
tfidfCutoffPercentile: 80
```

Goal	Previous (<0.22.7)	Improved Algorithm (>= 0.22.7)
Main Category	18%	58%
Granular Category	10%	42%

Fixes

Fix unexpected behavior on geoCoordinates 0,0 (#825)
GeoCoordinates of 0,0 - infamously known as Null Island - would lead to the geoCoordinates property disappearing entirely as 0 also happens to be the null/initial value for a property of type float in Golang. This release fixes this and we explicitly display a 0-Coordinate as such now.

Assets 2

06 Apr 13:06

etiennedi

0.22.6

6171437

0.22.6 - Filter objects by count of references

Docker image/tag: semitechnologies/weaviate:0.22.6
See also: example docker compose files in English, German, Dutch, Italian and Czech.

Breaking Changes

none

New Features

Filter objects by count of references (#1101)
Weaviate has already offered substantial "filter by references" capabilities in the past, such as "Find all Cities located in a Country with a population size larger than x". However, prior to this release it was not possible to filter for cases such as "Show all Cities not associated with a Country" or "Find all authors who wrote at least 2 articles".

This release adds the ability to filter by reference count. To do so, simply provide one of the existing compare operators (Equal, LessThan, LessThanEqual, GreaterThan, GreaterThanEqual) and use it directly on the reference element. For example, the following GraphQL query:
```
{
 Get {
   Things {
     Author(
       where:{
         valueInt: 2
         operator:GreaterThanEqual
         path: ["WroteArticles"]
       }
     ) {
       name
       WroteArticles {
         ... on Article {
           title
         }
       }
     }
   }
 }
}
```
Note: The example above uses the News Publication dataset.

Fixes

none

Assets 2

01 Apr 08:56

etiennedi

0.22.5

0ab76a1

0.22.5 - More hypertext references in API & Important Contextionary Fix

Docker image/tag: semitechnologies/weaviate:0.22.5
See also: example docker compose files in English, German, Dutch, Italian and Czech.

Breaking Changes

none

New Features

Hypertext Links on API root (#1108, #1103)
Prior to this, accessing the path / would return 404 Not Found. This was changed as follows:
- / redirects (301 Moved Permanently) to /v1 which is the api base. If the client does not automatcially follow redirects, a json is presented which contains the link to /v1
- /v1shows a list of main APIs and links to documentation for each resource group. Note this is not a complete list, as the intention is not to list every possible option (We have the swagger document for this). Instead the links work like website links where on the root page you are a presented with a few main cateogories.
- If the origin optioned is configured in the weavite config, an absolute URI is used. This can be helpful when weaviate is running behind a reverse proxy (which is most likely the case in a production setting). Then weaviate has no way of knowing how the user accesses it without it being explicitly configured. If the origin config is not set, links do not default to the listen/bind address as origin, instead relative links are presented.
Hypertext Links cross-references (#1106)
Similar to the API root links, all REST endpoints which can show cross-references now include a read-only field href alongside the existing beacon field. The field contains an HTTP Hypertext Reference to the resprective resources. The same behavior regarding origin in the config and absolute vs relative URIs as outline above applies to these links as well.

Fixes

Memory leak fixed in contextionary (weaviate/contextionary#25)
We discovered a potential memory leak in a library used in the contextionary. In some cases after long import sessions the contextionary memory usage would keep growing without a limit. We have replaced the code from the external library with custom code in semi-technologies#26 thus fixing the issue.

The docker-compose files linked above already reference the new version. If you are running your own setup or the K8s setup via the official helm chart, make sure you reference version <language>0.14.0-v0.4.8 or higher.

Assets 2

05 Mar 16:36

etiennedi

0.22.4

beb55b5

0.22.4 - New contextionary languages added

Docker image/tag: semitechnologies/weaviate:0.22.4
See also: example docker compose files in English, German, Dutch, Italian and Czech.

Breaking Changes

none

New Features

New contextionary languages added for contextionary version xx0.13.0-v0.4.7. See the links above for example docker-compose files for supported languages. You can use the linked contextionary images in other setups (Kubernetes, Helm) as well.

Fixes

none

Assets 2

03 Mar 08:49

etiennedi

0.22.3

3dd2739

0.22.3 - Bugfixes and Vector as part of Object's Meta

Docker image/tag: semitechnologies/weaviate:0.22.3
See also: example docker compose files in english and dutch.

Breaking Changes

none

New Features

Return objects' vector position when meta=true (#1041)
As part of the classification feature a meta option (passed as a query parameter) was added to the GET /v1/things and GET /v1/actions API. If the object was part of a classification, meta information about that classification is printed. Additionally, the meta object will now - regardless of classifications - also contains the objects vector position.

Keep in mind that the 600-dimensional vector is about 2.4KB of size in the underlying storage and about twice that size when encoded as float numbers in json. So you will add about 5KB of data per object when setting meta=true. While this is negligible on single objects, the additional data to be transferred on long list queries might add up to a lot of additional traffic. So, only set this option if really necessary.

Fixes

Bug: ?meta=true ignored on list queries (#1099)
Prior to this release setting the meta=true query param worked on GET /v1/things/{id} (single object), but not on GET /v1/things (list of objects). This releases fixes this and makes sure meta=true can now be set on both types of GET queries
Bug: Numbers and other characters lead to error in /c11y/concepts endpoint (#1078)
The requirements for class names and other schema fields have been loosened in the past. As of now any utf-8 letter or digit is an acceptable character. However, the /c11y/concepts endpoint. which can be used to inspect word concepts in the contextionary space, still validated a strict [A-Za-z]. This has been changed and now all utf-8 letters and digits are acceptable.

Assets 2

28 Feb 14:43

etiennedi

0.22.2

cf0e803

0.22.2 - Parse and Normalize Phone Numbers

Docker image/tag: semitechnologies/weaviate:0.22.2
See also: example docker compose files in english and dutch.

Breaking Changes

none

New Features

Upgrade to Go 1.14 (#1090)
No user-facing changes. Even for contributors it's very unlikely that this update introduced any changes. But we recommend updating your Go environment to the latest version if you plan on contributing to Weaviate. Thanks.
New Data Type: phoneNumber (#1088 and #1087)
A new data type with the name phoneNumber was added. This type is a primitive type like text, string, etc - as compared to reference type. Similar to the existing type geoProperties, the new type contains more than a single field.

The full type definition can be seen in the swagger.json definition

Usage

There are two user-settable sub-fields (input and defaultCountry). input must always be set when using the type, defaultCountry must only be set in specific situations:
- When you enter an international number (e.g. +49 171 1234567) no defaultCountry must be entered, as the underlying parser will recognize that the above is a German number due to the +49 prefix
- When you enter the same number as above in a national format (e.g. 0171 1234567), you need to specify the defaultCounty (in this case: "de"), so that the parse can correctly convert the number into all formats.
Inputs and Formats
- phoneNumber.input is of type string. You can enter any phone number. Optional digits, such as an optional 0 (e.g. +49 (0) 171 ....) will be automatically recognized and normalized. Furthermore all formatting helpers, such as dashes or spaces are being removed by the parser.
- phoneNumber.defaultCountry is of type string. See "Usage" above on when this optional field is required. Content should be entered as ISO 3166-1 alpha-2 country codes.
Read-only fields after parsing

When reading back a field of type phone number, the following (read-only) fields appear:
- internationalFormatted (string): Phone number in international format, e.g. "+49 171 123456"
- national (unsigned integer): National part of the phone number, eg. 171123456
- nationalFormatted (string): Phone number in national format, eg. "0171 123456"
- countryCode (unsigned integer): Country-code digits, e.g. 49
- valid (boolean): Whether the parser recognized the phone number as valid
- input (string): The raw phone number as put in by the user (helpful for debugging purposes), see Usage above
- defaultCountry (string) The default country as put in by the user, only set if explicitly set by the user, see Usage above
Limitations

The following phone-related features are not yet part of the above release
- Search by phone numbers (#1089)
- Aggregate phone numbers

Fixes

none

Assets 2

04 Feb 09:32

etiennedi

0.22.1

a6fc3a9

0.22.1 - Influence Weights in Vector Creation

Docker image/tag: semitechnologies/weaviate:0.22.1
See also: example docker compose files in english and dutch.

Breaking Changes

none

New Features

Override weights on vector creation (#1070 and #1074)
Prior to this release the weight of each individual word when creating a vector from an object was out of the user's control. The contextionary uses an algorithm based on the general occurrence of the word in its training data, to suggest how each word should be weighted. The underlying assumption is that a rare word should take more precedence over a very common word, similar to tf-idf.

This works well in most cases, but in some use-case specific domain languages common words get a new meaning and therefore their importance should change. Imagine the words "far" and "near". They are quite common in overall language, so - especially when mixed with rarer words - they wouldn't get a great weight. However, now assume you're in the domain of optometry or manufacturing glasses. In the terms "far-sighted" and "near-sighted", the words "near" and "far" make a very important distinction. Imagine you were trying to classify objects based on those terms. With the changes in 0.22.1 you can now influence - or even completely override - the weights of individual words when creating vectors.

To do so, the field vectorWeights was introduced to the Thing and Action objects. The field is a key-value map where both the keys and the values must be strings. The keys are the words you want to influence and the value is a mathematical expression to set the new weight. You can use additions, subtractions, multiplications, divisions or simply overwrite the weight with a fixed number. To reference the original weights, use the single-letter variable w. Some examples:
- "vectorWeights": {"far": "10 * w"}
  Give the word "far" 10 times its original weight
- "vectorWeights": {"far": "w + 0.5", "near": "w - 0.5"}
  Give the word "far" an absolute boost of 0.5, while penalizing the word "near" by 0.5.
- "vectorWeights": {"sighted": "0.7", "glasses": "2 - 4 * w"}
  Let the word "sighted" have a fixed weight of 0.7 whereas the word "glasses" is calculated by subtracting 4 times the original weight from the number 2.
Some important things to note:
- For this feature to work you need a contextionary version of at least ...v0.4.7. The example docker-compose files linked above have already been updated to the required version.
- Spaces in math expressions have no meaning.
- A word that is not referenced in "vectorWeights" will simply use its original weight as returned by the contextionary.
- Custom vectorWeights only affect the object which they are set on, there is no option to globally manipulate a specific word. If the same vectorWeights are required for multiple objects, simply attach them to all objects where needed.
- Whenever the mathematical expression is not a fixed number (such as "17") an operator must be present. It is not valid to use implicit operators, such as "2w" which would mean "two times the original weight". In this case explicitly use the multiplication operator, e.g. "2 * w" or "w*2".
Full example

Here's a full example for importing a thing object

POST /v1/things
```
{
 "class": "Glasses",
 "schema": {
   "description": "These glasses are meant for far-sighted people"
 },
 "vectorWeights": {
   "far": "5 * w",
   "near": "5 * w"
 }
}
```
The above example will boost the words "far" or "near" by a factor of 5. Note that the object does not contain the word "near", so only the word "far" is boosted. The other unreferenced words maintain their original weights.

Fixes

none

Assets 2

23 Jan 16:20

etiennedi

0.22.0

19dc806

0.22.0 - Updated Cross-Reference Storage Strategy

Docker image/tag: semitechnologies/weaviate:0.22.0
See also: example docker compose files in english and dutch.

Contains Breaking Change!

Note: While this release contains no API-level breaking changes, the internals have changes so much, that we recommend not to simply replace your existing Weaviate container with the new one. Instead you should create a new cluster and reimport our things and actions. See changelog below for more detailed reasons why.

Breaking Changes

Improve cross-reference storing strategy (#1069)
Prior to this release Weaviate would build an automated cache of referenced objects. This led to very fast response time for nested queries, at the cost of large disk usage. We have since learned that disk usage can be so excessive in heavily connected graphs that the benefits don't outweigh the costs. In addition configuring cache boundaries led to unnecessary complexity.

The major goal of 0.22.0 was to replace automated denormalization caching with a smarter strategy without losing the snappiness of cached results and the overall low latencies of queries our users have come to appreciate.

We believe we have found a good strategy with this release, by implementing smarter query strategies to keep inter-container traffic to a minimum and use our backing storage in a way it performs well.

This boils down to the following advantages that 0.22.0 provides over 0.21.x:
- Feature parity No feature got lost through the rewrite. If it worked with 0.21.x it works with 0.22.x. If you think otherwise, please open an issue
- Much smaller disk footprint Since we don't excessively normalize references anymore, the disk footprint got much smaller. Essentially the size on disk is now (object size + vector size + index overheads) * desiredReplication. The amount of cross-references no longer has a direct impact on disk space (other than storing the link itself which is effectively the size of the bytes in a weaviate://... beacon)
- No depth limit on nested filters Prior to this release a filter on a cross ref prop, such as path: ["inCity", "City", "inCountry", "Country", "name"] had a limit. It would only work within a cache boundary. This limitation is now gone and you can filter as deep as you like. Please note that an excessively deep query will have a perfomance impact.
- Smaller CPU impact during imports Prior to this release we'd spent a share of the available resources on building a denormalized cache asynchronously after importing a connected object. Without having to build such a cache, more performance on imports is available for storing, vectorizing and indexing objects.
Please note that caching was previously done at import time. We recommend not to try to upgrade a 0.21.x cluster, but instead creating a new cluster and reimporting. This is the only way to guarantee your cluster won't have cache leftovers which can impact performance.

New Features

none

Fixes

#967 became obsolete through this change

Assets 2

17 Jan 16:24

etiennedi

0.21.12

a5b0914

0.21.12 - Improved Contextionary Weighing Algorithm

Docker image/tag: semitechnologies/weaviate:0.21.12
See also: example docker compose files in english and dutch.

Breaking Changes

none

New Features

none

Fixes

Improved Contextionary Weighting Algorithm
This release updates the default contextionary version to ...v0.4.6 which includes an improved weighting algorithms. Prior to this release the occurrence-based weighting was done with a linear algorithm. This often led to unimportant words getting too much weight. The latest version uses a logarithmic approach. With this approach we were able to improve the accuracy of classifications done with weaviate.

The example docker-compose files linked above have already been updated. If you're not using them, make sure to update the contextionary version accordingly in your setup.

This change is non-breaking. Keep in mind that object vectorization happens at import time. So if you want all your objects to benefit from the updated algorithm, you should reimport them.

If you aren't happy with the results and would like to use the classic linear approach, you can force the contextionary to do so, by setting the environment variable OCCURRENCE_WEIGHT_STRATEGY=linear for the contextionary (!) service. It defaults to log.

Assets 2

16 Jan 16:43

etiennedi

0.21.11

41b43e2

0.21.11 - Entity Merging

Docker image/tag: semitechnologies/weaviate:0.21.11
See also: example docker compose files in english and dutch.

Breaking Changes

none

New Features

Entity Merging (#975)
Entity merging allows you to deduplicate results. If you have several objects which describe the same physical entity, e.g. "Google Inc." and "Google Incorporated" (they both describe the real-world company "Google"), you can hide duplicates or even let Weaviate merge duplicates into a single entity.

Usage

Usage is best described in the following three example screenshots.

No grouping/merging
First up is the behavior without any grouping or merging strategy. As you can see there are a lot of duplicates:

Grouping strategy closest
With strategy closest Weaviate tries to build groups based on your results. For each group it will show the results closest to your search query. Note that there is also a force field. The higher the force the more likely Weaviate is going to group two objects together. The force: 1.0 would mean that every single item, no matter how different should be grouped. A force: 0 means that only exactly identical items should be grouped. The example below uses force: 0.1 as that yielded the best results. You can see that no more company names are duplicated:

Grouping strategy merge
The example above hides duplicates. This isn't an issue if every single field is identical. But what if you need to know the original values. Strategy merge will keep the contents of the original fields. String fields contain all original values as shown below, numerical fields display a mean and reference fields contain all the references from all merged objects:

Best Practices

To get the best possible results, please keep the following things in mind:
- The grouping/merging is done internally based on vector distance. It is thus important that the items to be merged are as close to each other as possible. If your items use a lot of words which are not recognized by the contextionary, those words do not influence the vector position. In this case consider extending the contextionary using the REST API (/c11y/extensions), so that it understands more words from your object
- You get the best possible results if noise is removed in vectorization, we thus strongly recommend setting vectorizeClassName: false and vectorizePropertyName: false for each property. Those settings were introduced in 0.21.10.

Fixes

none

Assets 2

Releases: weaviate/weaviate

0.22.7 - Improved Contextual Classification & Bugfixes

Breaking Changes

New Features

Benchmark

Main Category

Granular Category

Fixes

0.22.6 - Filter objects by count of references

Breaking Changes

New Features

Fixes

0.22.5 - More hypertext references in API & Important Contextionary Fix

Breaking Changes

New Features

Fixes

0.22.4 - New contextionary languages added

Breaking Changes

New Features

Fixes

0.22.3 - Bugfixes and Vector as part of Object's Meta

Breaking Changes

New Features

Fixes

0.22.2 - Parse and Normalize Phone Numbers

Breaking Changes

New Features

Usage

Inputs and Formats

Read-only fields after parsing

Limitations

Fixes

0.22.1 - Influence Weights in Vector Creation

Breaking Changes

New Features

Full example

Fixes

0.22.0 - Updated Cross-Reference Storage Strategy

Contains Breaking Change!

Breaking Changes

New Features

Fixes

0.21.12 - Improved Contextionary Weighing Algorithm

Breaking Changes

New Features

Fixes

0.21.11 - Entity Merging

Breaking Changes

New Features

Usage

Best Practices

Fixes