<picture>
  <source media="(prefers-color-scheme: dark)" srcset="https://vespa.ai/assets/vespa-ai-logo-heather.svg">
  <source media="(prefers-color-scheme: light)" srcset="https://vespa.ai/assets/vespa-ai-logo-rock.svg">
  <img alt="#Vespa" width="200" src="https://vespa.ai/assets/vespa-ai-logo-rock.svg" style="margin-bottom: 25px;">
</picture>

# Read and write operations

This notebook documents ways to feed, get, update and delete data:

- Using context manager with `with` for efficiently managing resources
- Feeding streams of data using `feed_iter` which can feed from streams, Iterables, Lists and files by the use of generators


<div class="alert alert-info">
    Refer to <a href="https://pyvespa.readthedocs.io/en/latest/troubleshooting.html">troubleshooting</a>
    for any problem when running this guide.
</div>


## Deploy a sample application

[Install pyvespa](https://pyvespa.readthedocs.io/) and start Docker, validate minimum 4G available:


In [1]:
!docker info | grep "Total Memory"

Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.


Define a simple application package with five fields


In [2]:
from vespa.application import ApplicationPackage
from vespa.package import Schema, Document, Field, FieldSet, HNSW, RankProfile

app_package = ApplicationPackage(
    name="vector",
    schema=[
        Schema(
            name="doc",
            document=Document(
                fields=[
                    Field(name="id", type="string", indexing=["attribute", "summary"]),
                    Field(
                        name="title",
                        type="string",
                        indexing=["index", "summary"],
                        index="enable-bm25",
                    ),
                    Field(
                        name="body",
                        type="string",
                        indexing=["index", "summary"],
                        index="enable-bm25",
                    ),
                    Field(
                        name="popularity",
                        type="float",
                        indexing=["attribute", "summary"],
                    ),
                    Field(
                        name="embedding",
                        type="tensor<bfloat16>(x[1536])",
                        indexing=["attribute", "summary", "index"],
                        ann=HNSW(
                            distance_metric="innerproduct",
                            max_links_per_node=16,
                            neighbors_to_explore_at_insert=128,
                        ),
                    ),
                ]
            ),
            fieldsets=[FieldSet(name="default", fields=["title", "body"])],
            rank_profiles=[
                RankProfile(
                    name="default",
                    inputs=[("query(q)", "tensor<float>(x[1536])")],
                    first_phase="closeness(field, embedding)",
                )
            ],
        )
    ],
)



In [3]:
from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=app_package)

Waiting for configuration server, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 10/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Application is up!
Finished deployment.


## Feed data by streaming over Iterable type

This example notebook uses the [dbpedia-entities-openai-1M](https://huggingface.co/datasets/KShivendu/dbpedia-entities-openai-1M)
dataset (1M OpenAI Embeddings (1536 dimensions) from June 2023).

The `streaming=True` option allow paging the data on-demand from HF S3.
This is extremely useful for large datasets, where the data does not fit in memory
and downloading the entire dataset is not needed.
Read more about [datasets stream](https://huggingface.co/docs/datasets/stream).


In [4]:
from datasets import load_dataset

dataset = load_dataset(
    "KShivendu/dbpedia-entities-openai-1M", split="train", streaming=True
).take(10000)

Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

### Converting to dataset field names to Vespa schema field names

We need to convert the dataset field names to the configured Vespa schema field names, we do this with a simple lambda function.

The map function does not page the data, the map step is performed lazily if we start iterating over the dataset.
This allows chaining of map operations where the lambda is yielding the next document.


In [5]:
pyvespa_feed_format = dataset.map(
    lambda x: {"id": x["_id"], "fields": {"id": x["_id"], "embedding": x["openai"]}}
)

Feed using [feed_iterable](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.application.Vespa.feed_iterable) which
accepts an `Iterable`. `feed_iterable` accepts a callback callable routine that is called for every single data operation so we can
check the result. If the result `is_successful()` the operation is persisted and applied in Vespa.


In [6]:
from vespa.io import VespaResponse


def callback(response: VespaResponse, id: str):
    if not response.is_successful():
        print(
            f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}"
        )

In [7]:
app.feed_iterable(
    iter=pyvespa_feed_format,
    schema="doc",
    namespace="benchmark",
    callback=callback,
    max_queue_size=4000,
    max_workers=16,
    max_connections=16,
)

Preparing feed requests...


10000 Requests [00:40, 246.21 Requests/s]

Requests completed





[<vespa.io.VespaResponse at 0x11b18a8e0>,
 <vespa.io.VespaResponse at 0x11b14b1f0>,
 <vespa.io.VespaResponse at 0x11b1960d0>,
 <vespa.io.VespaResponse at 0x11b14b250>,
 <vespa.io.VespaResponse at 0x11b112520>,
 <vespa.io.VespaResponse at 0x11afbf9a0>,
 <vespa.io.VespaResponse at 0x11af78f40>,
 <vespa.io.VespaResponse at 0x11afe8340>,
 <vespa.io.VespaResponse at 0x11af78a90>,
 <vespa.io.VespaResponse at 0x12fc3beb0>,
 <vespa.io.VespaResponse at 0x11afb31c0>,
 <vespa.io.VespaResponse at 0x11af8f250>,
 <vespa.io.VespaResponse at 0x11af8f280>,
 <vespa.io.VespaResponse at 0x11af8f700>,
 <vespa.io.VespaResponse at 0x11b1a6c40>,
 <vespa.io.VespaResponse at 0x11aff5550>,
 <vespa.io.VespaResponse at 0x11b1129a0>,
 <vespa.io.VespaResponse at 0x12fc3b610>,
 <vespa.io.VespaResponse at 0x11b12fb50>,
 <vespa.io.VespaResponse at 0x11af8f340>,
 <vespa.io.VespaResponse at 0x11af9b310>,
 <vespa.io.VespaResponse at 0x11b112be0>,
 <vespa.io.VespaResponse at 0x11af9bc70>,
 <vespa.io.VespaResponse at 0x11b1

### Feeding with generators

The above handled streaming data from a remote repo, we can also use generators or just List. In this example, we generate synthetic data
using a generator function.


In [8]:
def my_generator() -> dict:
    for i in range(1000):
        yield {
            "id": str(i),
            "fields": {
                "id": str(i),
                "title": "title",
                "body": "this is body",
                "popularity": 1.0,
            },
        }

In [9]:
app.feed_iterable(
    iter=my_generator(),
    schema="doc",
    namespace="benchmark",
    callback=callback,
    max_queue_size=4000,
    max_workers=12,
    max_connections=12,
)

Preparing feed requests...


1000 Requests [00:01, 959.21 Requests/s]

Requests completed





[<vespa.io.VespaResponse at 0x11afb3a60>,
 <vespa.io.VespaResponse at 0x11afb3880>,
 <vespa.io.VespaResponse at 0x11afa9b20>,
 <vespa.io.VespaResponse at 0x11b166910>,
 <vespa.io.VespaResponse at 0x11af789d0>,
 <vespa.io.VespaResponse at 0x11b18abe0>,
 <vespa.io.VespaResponse at 0x11b196550>,
 <vespa.io.VespaResponse at 0x11af78970>,
 <vespa.io.VespaResponse at 0x13bb9b4f0>,
 <vespa.io.VespaResponse at 0x11b196a00>,
 <vespa.io.VespaResponse at 0x14bbb9ca0>,
 <vespa.io.VespaResponse at 0x14b4768b0>,
 <vespa.io.VespaResponse at 0x11b18a940>,
 <vespa.io.VespaResponse at 0x11b14bb80>,
 <vespa.io.VespaResponse at 0x13bfffd60>,
 <vespa.io.VespaResponse at 0x11afb3910>,
 <vespa.io.VespaResponse at 0x12fba0fd0>,
 <vespa.io.VespaResponse at 0x11b166d00>,
 <vespa.io.VespaResponse at 0x11b156940>,
 <vespa.io.VespaResponse at 0x11af78f10>,
 <vespa.io.VespaResponse at 0x12fba0e80>,
 <vespa.io.VespaResponse at 0x14bbb9490>,
 <vespa.io.VespaResponse at 0x13bcca1c0>,
 <vespa.io.VespaResponse at 0x11af

### Updates

Using a similar generator we can update the fake data we added. This performs
[partial updates](https://docs.vespa.ai/en/partial-updates.html), assigning the `popularity` field to have the value `2.0`.

Note that PyVespa only supports `assign` type of [partial updates](https://docs.vespa.ai/en/reference/document-json-format.html#update)
and will automatically rewrite an update operation with fields like this

```
 "fields": {
    "title":"The best of Bob Dylan"
  }
```

To the correct JSON update syntax expected by Vespa:

```
 "fields": {
        "title": {
            "assign": "The best of Bob Dylan"
        }
}
```


In [10]:
def my_update_generator() -> dict:
    for i in range(1000):
        yield {"id": str(i), "fields": {"popularity": 2.0}}

In [11]:
responses = app.feed_iterable(
    iter=my_update_generator(),
    schema="doc",
    namespace="benchmark",
    callback=callback,
    operation_type="update",
    max_queue_size=4000,
    max_workers=12,
    max_connections=12,
)

Preparing update requests...


1000 Requests [00:01, 954.70 Requests/s]

Requests completed





[<vespa.io.VespaResponse at 0x14b1ccd60>,
 <vespa.io.VespaResponse at 0x14faaf3d0>,
 <vespa.io.VespaResponse at 0x14ab546a0>,
 <vespa.io.VespaResponse at 0x13b7eba60>,
 <vespa.io.VespaResponse at 0x14b4424c0>,
 <vespa.io.VespaResponse at 0x13b9c30a0>,
 <vespa.io.VespaResponse at 0x14b1ff760>,
 <vespa.io.VespaResponse at 0x14ac18550>,
 <vespa.io.VespaResponse at 0x14f488be0>,
 <vespa.io.VespaResponse at 0x13e959070>,
 <vespa.io.VespaResponse at 0x13b545b20>,
 <vespa.io.VespaResponse at 0x12ed742b0>,
 <vespa.io.VespaResponse at 0x14b1e58e0>,
 <vespa.io.VespaResponse at 0x12ed7f0a0>,
 <vespa.io.VespaResponse at 0x14faaf220>,
 <vespa.io.VespaResponse at 0x14faaf640>,
 <vespa.io.VespaResponse at 0x14bbe08e0>,
 <vespa.io.VespaResponse at 0x13ba04550>,
 <vespa.io.VespaResponse at 0x13b9c3940>,
 <vespa.io.VespaResponse at 0x14b2bf370>,
 <vespa.io.VespaResponse at 0x14ba81550>,
 <vespa.io.VespaResponse at 0x14ba81e50>,
 <vespa.io.VespaResponse at 0x12efd4550>,
 <vespa.io.VespaResponse at 0x11af

We can now query the data, notice how we use a context manager `with` to close connection after query
This avoids resource leakage and allows for reuse of connections. In this case, we only do a single
query and there is no need for having more than one connection. Setting more connections will just
increase connection level overhead.


In [12]:
from vespa.io import VespaQueryResponse

with app.syncio(connections=1):
    response: VespaQueryResponse = app.query(
        yql="select id from doc where popularity > 1.5", hits=0
    )
    print(response.number_documents_retrieved)

1000


### Deleting

Delete all the synthetic data with a custom generator. Now we don't need the `fields` key.


In [13]:
def my_delete_generator() -> dict:
    for i in range(1000):
        yield {"id": str(i)}


responses = app.feed_iterable(
    iter=my_delete_generator(),
    schema="doc",
    namespace="benchmark",
    callback=callback,
    operation_type="delete",
    max_queue_size=5000,
    max_workers=12,
    max_connections=12,
)

Preparing delete requests...


1000 Requests [00:00, 1066.23 Requests/s]

Requests completed





[<vespa.io.VespaResponse at 0x13fc2e4f0>,
 <vespa.io.VespaResponse at 0x13b62c820>,
 <vespa.io.VespaResponse at 0x14b8595e0>,
 <vespa.io.VespaResponse at 0x14aec17f0>,
 <vespa.io.VespaResponse at 0x12fda4c40>,
 <vespa.io.VespaResponse at 0x14f03a580>,
 <vespa.io.VespaResponse at 0x13ef30940>,
 <vespa.io.VespaResponse at 0x14ff482e0>,
 <vespa.io.VespaResponse at 0x13b660700>,
 <vespa.io.VespaResponse at 0x14f2a4ee0>,
 <vespa.io.VespaResponse at 0x14bd78a60>,
 <vespa.io.VespaResponse at 0x13bb9b9a0>,
 <vespa.io.VespaResponse at 0x14aec1340>,
 <vespa.io.VespaResponse at 0x14aa94970>,
 <vespa.io.VespaResponse at 0x12edb02b0>,
 <vespa.io.VespaResponse at 0x14ac63700>,
 <vespa.io.VespaResponse at 0x11afe80a0>,
 <vespa.io.VespaResponse at 0x14b859850>,
 <vespa.io.VespaResponse at 0x13ffd6670>,
 <vespa.io.VespaResponse at 0x11af8fc10>,
 <vespa.io.VespaResponse at 0x13ee5f370>,
 <vespa.io.VespaResponse at 0x14bc1bb50>,
 <vespa.io.VespaResponse at 0x13b6604c0>,
 <vespa.io.VespaResponse at 0x13bb

## Feeding operations from a file

This demonstrates how we can use `feed_iter` to feed from a large file without reading the entire file, this also
uses a generator.

First we dump some operations to the file and peak at the first line:


In [14]:
# Dump some operation to a jsonl file, we store it in the format expected by pyvespa
# This to demonstrate feeding from a file in the next section.
import json

with open("documents.jsonl", "w") as f:
    for doc in dataset:
        d = {"id": doc["_id"], "fields": {"id": doc["_id"], "embedding": doc["openai"]}}
        f.write(json.dumps(d) + "\n")

Define the file generator that will yield one line at a time


In [15]:
import json


def from_file_generator() -> dict:
    with open("documents.jsonl") as f:
        for line in f:
            yield json.loads(line)

In [16]:
responses = app.feed_iterable(
    iter=from_file_generator(),
    schema="doc",
    namespace="benchmark",
    callback=callback,
    operation_type="feed",
    max_queue_size=4000,
    max_workers=12,
    max_connections=12,
)

Preparing feed requests...


10000 Requests [00:17, 560.54 Requests/s]

Requests completed





[<vespa.io.VespaResponse at 0x13eeb7370>,
 <vespa.io.VespaResponse at 0x12ec175e0>,
 <vespa.io.VespaResponse at 0x13b4d3190>,
 <vespa.io.VespaResponse at 0x14f3bcd00>,
 <vespa.io.VespaResponse at 0x13b4b3670>,
 <vespa.io.VespaResponse at 0x13eeb73d0>,
 <vespa.io.VespaResponse at 0x13eeb0c10>,
 <vespa.io.VespaResponse at 0x13eeb0dc0>,
 <vespa.io.VespaResponse at 0x12fb121c0>,
 <vespa.io.VespaResponse at 0x12ec217f0>,
 <vespa.io.VespaResponse at 0x12ec084c0>,
 <vespa.io.VespaResponse at 0x12ec17f10>,
 <vespa.io.VespaResponse at 0x14f2a4640>,
 <vespa.io.VespaResponse at 0x12ec177f0>,
 <vespa.io.VespaResponse at 0x12f824340>,
 <vespa.io.VespaResponse at 0x11af43100>,
 <vespa.io.VespaResponse at 0x14b1fff40>,
 <vespa.io.VespaResponse at 0x13eeb0100>,
 <vespa.io.VespaResponse at 0x13eeb0490>,
 <vespa.io.VespaResponse at 0x13eeb7460>,
 <vespa.io.VespaResponse at 0x13b655790>,
 <vespa.io.VespaResponse at 0x13b6554c0>,
 <vespa.io.VespaResponse at 0x12fb12fa0>,
 <vespa.io.VespaResponse at 0x13b6

### Get and Feed individual data points

Feed a single data point to Vespa


In [17]:
with app.syncio(connections=1):
    response: VespaResponse = app.feed_data_point(
        schema="doc",
        namespace="benchmark",
        data_id="1",
        fields={
            "id": "1",
            "title": "title",
            "body": "this is body",
            "popularity": 1.0,
        },
    )
    print(response.is_successful())
    print(response.get_json())

True
{'pathId': '/document/v1/benchmark/doc/docid/1', 'id': 'id:benchmark:doc::1'}


Get the same document, try also to change data_id to a document that does not exist which will raise a 404 http error.


In [18]:
with app.syncio(connections=1):
    response: VespaResponse = app.get_data(
        schema="doc",
        namespace="benchmark",
        data_id="1",
    )
    print(response.is_successful())
    print(response.get_json())

True
{'pathId': '/document/v1/benchmark/doc/docid/1', 'id': 'id:benchmark:doc::1', 'fields': {'body': 'this is body', 'title': 'title', 'popularity': 1.0, 'id': '1'}}


### Upsert

The following sends an update operation, if the document exist, the popularity field will be updated to take the value 3.0, and if the document
does not exist, it's created and where the popularity value is 3.0.


In [19]:
with app.syncio(connections=1):
    response: VespaResponse = app.update_data(
        schema="doc",
        namespace="benchmark",
        data_id="does-not-exist",
        fields={"popularity": 3.0},
        create=True,
    )
    print(response.is_successful())
    print(response.get_json())

True
{'pathId': '/document/v1/benchmark/doc/docid/does-not-exist', 'id': 'id:benchmark:doc::does-not-exist'}


In [20]:
with app.syncio(connections=1):
    response: VespaResponse = app.get_data(
        schema="doc",
        namespace="benchmark",
        data_id="does-not-exist",
    )
    print(response.is_successful())
    print(response.get_json())

True
{'pathId': '/document/v1/benchmark/doc/docid/does-not-exist', 'id': 'id:benchmark:doc::does-not-exist', 'fields': {'popularity': 3.0}}


## Cleanup


In [21]:
vespa_docker.container.stop()
vespa_docker.container.remove()

## Next steps

Read more on writing to Vespa in [reads-and-writes](https://docs.vespa.ai/en/reads-and-writes.html).
