Skip to content

v1.8.0 - Horizontal Scalability, Sharding, Pagination, Important Fixes

Compare
Choose a tag to compare
@etiennedi etiennedi released this 30 Nov 17:51

Migration Notice

Version v1.8.0 introduces multi-shard indices and horizontal scaling. As a result the dataset needs to be migrated. This migration is performed automatically - without user interaction - when first starting up with Weaviate version v1.8.0. However, it cannot be reversed. We, therefore, recommend carefully reading the following migration notes and making a case-by-case decision about the best upgrade path for your needs.

Why is a data migration necessary?

Prior to v1.8.0 Weaviate did not support multi-shard indices. The feature was already planned, therefore data was already contained in a single shard with a fixed name. A migration is necessary to move the data from a single fixed shard into a multi-shard setup. The amount of shards is not changed. When you run
v1.8.0 on a dataset the following steps happen automatically:

  • Weaviate discovers the missing sharding configuration for your classes and fills it with the default values
  • When shards start-up and they do not exist on disk, but a shard with a fixed
    name from v1.7.x exists, Weaviate automatically recognizes that a migration
    is necessary and moves the data on disk
  • When Weaviate is up and running the data has been migrated.

Important Notice: As part of the migration Weaviate will assign the shard to the (only) node available in the cluster. You need to make sure that this node has a stable hostname. If you run on Kubernetes, hostnames are stable (e.g. weaviate-0 for the first node). However with docker-compose hostnames default to the id of the container. If you remove your containers (e.g. docker-compose down) and start them up again, the hostname will have changed. This will lead to errors where Weaviate mentions that it cannot find the node that the shard belongs to. The node sending the error message is the node that owns the shard itself, but it cannot recognize it, since its own name has changed.

To remedy this, you can set a stable hostname before starting up with v1.8.0 by setting the env var CLUSTER_HOSTNAME=node1. The actual name does not matter, as long as it's stable.

If you forgot to set a stable hostname and are now running into the error mentioned above, you can still explicitly set the hostname that was used before which you can derive from the error message.

Example:

If you see the error message "shard Knuw6a360eCY: resolve node name \"5b6030dbf9ea\" to host", you can make Weaviate usable again, by setting 5b6030dbf9ea as the host name: CLUSTER_HOSTNAME=5b6030dbf9ea.

Should you upgrade or reimport?

Please note that besides new features, v1.8.0 also contains a large collection of bugfixes. Some of those bugs also affected how the HNSW index was written to disk. Therefore it cannot be ruled out that the index on disk has a subpar quality compared to a freshly built index in version v1.8.0. Therefore, if you can import using a script, etc, we generally recommend starting with a fresh v1.8.0 setup and reimporting instead of migrating.

Is downgrading possible after upgrading?

Note that the data migration which happens at the first startup of v1.8.0 is not automatically reversible. If you plan on downgrading to v1.7.x again after upgrading, you must explicitly create a backup of the state prior to upgrading.

Changelog

Breaking Changes

none, however see migration notice above

New Features

  • Horizontal Scalability (#1599, #1600, #1601, #1623, #1622, #1653, #1654, #1655, #1658, #1672, #1667, #1679, #1695)

    The big one! Too big for a small release notes page. Instead, we have written extensive documentation on all things around Horizontal Scalability.

    Please see:

  • Improvements for Filtered Vector Search (#1728, #1732)

    See benchmarks here. The Improvements namely consist of three parts:

  • Pagination #1627

    Starting with this release search results can now be paged. This feature is available on:

    • List requests (GET /v1/objects and GraphQL Get { Class { } })
    • Vector Searches (GraphQL near<Media>) and Filter Searches (GraphQL where: {})

    Usage

    To use pagination, one new parameter is introduced (offset) which works in conjunction with the existing limit parameter. For example, to list the first ten results, set limit: 10. Then, to "display the second page of 10", set offset: 10, limit:10 and so on. E.g. to show the 9th page of 10 results, set offset:80, limit:10 to effectively display results 81-90.

    To do so in REST, simply append the two parameters as URL params, e.g. GET /v1/objects?limit=25&offset=75
    To do so in GraphQL, simply add the two parameters to the class, e.g. { Get { MyClassName(limit:25, offset: 75) { ... } } }

    Performance and Resource Considerations & Limitations

    The pagination implementation is an offset-based implementation, not a cursor-based implementation. This has the following implications:

    • The cost of retrieving one further page is higher than that of the last. Effectively when searching for search results 91-100, Weaviate will internally retrieve 100 search results and discard results 0-90 before serving them to the user. This effect is amplified if running in a multi-shard setup, where each shard would retrieve 100 results, then the results aggregated and ultimately cut off. So in a 10-shard setup asking for results 91-100 Weavaite will effectively have to retrieve 1000 results (100 per shard) and discard 990 of them before serving. This means high page numbers lead to longer response times and more load on the machine/cluster.
    • Due to the increasing cost of each page outlined above, there is a limit to how many objects can be retrieved using pagination. By default setting the sum of offset and limit to higher than 10,000 objects, will lead to an error. If you must retrieve more than 10,000 objects, you can increase this limit by setting the environment variable QUERY_MAXIMUM_RESULTS=<desired-value>. Warning: Setting this to arbitrarily high values can make the memory consumption of a single query explode and single queries can slow down the entire cluster. We recommend setting this value to the lowest possible value that does not interfere with your users' expectations.
    • The pagination setup is not stateful. If the database state has changed between retrieving two pages there is no guarantee that your pages cover all results. If no writes happened, then pagination can be used to retrieve all possible within the maximum limit. This means asking for a single page of 10,000 objects will lead to the same results overall as asking for 100 pages of 100 results.

Fixes

  • General Performance Improvments around Memory Allocations (#1620)

    Thanks to @cdpierse for his contributions to this issue
  • Fix behavior that could lead to a crashloop after an unexpected shutdown or crash:

    • Crashloops after unexpected shutdowns #1697 #1698 #1703
    • HNSW integrity compromised after restarts #1701 #1705
    • Improve ingesting WAL at crash recovery startup #1713
    • Fix an issue where parsing the WAL would lead to the creation of another WAL, thus increasing the effort for recovery if it were to fail again. #1716
    • Fix an issue where a failure during memtable flush may have led to an unparsable disk segment #1725
    • Ignore zero-length disk segments files. Previously they could block startup. #1726
  • Fix panic on filters #1750

    Fixes an issue where invalid combinations of prop types and filter types could lead to panics
  • Other fixes

    • Filter by ID (introduced in 1.7.2) #1708
    • Use Feature Projection in text2vec-transformers module #1572