<picture>
  <source media="(prefers-color-scheme: dark)" srcset="https://vespa.ai/assets/vespa-ai-logo-heather.svg">
  <source media="(prefers-color-scheme: light)" srcset="https://vespa.ai/assets/vespa-ai-logo-rock.svg">
  <img alt="#Vespa" width="200" src="https://vespa.ai/assets/vespa-ai-logo-rock.svg" style="margin-bottom: 25px;">
</picture>

# Feeding performance

This explorative notebook intends to shine some light on the different modes of feeding documents to Vespa.
We will look at these different methods:

1. Using `feed_iterable()`.
2. Using `feed_iterable_async()`
3. Using Vespa CLI[TODO:Link].


<div class="alert alert-info">
    Refer to <a href="https://pyvespa.readthedocs.io/en/latest/troubleshooting.html">troubleshooting</a>
    for any problem when running this guide.
</div>


[Install pyvespa](https://pyvespa.readthedocs.io/) and start Docker Daemon, validate minimum 6G available:


In [1]:
#!pip3 install pyvespa
#!docker info | grep "Total Memory"

## Deploy the Vespa application

Deploy `package` on the local machine using Docker,
without leaving the notebook, by creating an instance of
[VespaDocker](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.deployment.VespaDocker). `VespaDocker` connects
to the local Docker daemon socket and starts the [Vespa docker image](https://hub.docker.com/r/vespaengine/vespa/).

If this step fails, please check
that the Docker daemon is running, and that the Docker daemon socket can be used by clients (Configurable under advanced settings in Docker Desktop).


## Create an application package

The [application package](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.ApplicationPackage)
has all the Vespa configuration files -
create one from scratch:


In [None]:
%load_ext autoreload
%autoreload 2

In [2]:
from vespa.package import (
    ApplicationPackage,
    Field,
    Schema,
    Document,
    FieldSet,
)

package = ApplicationPackage(
    name="pyvespafeed",
    schema=[
        Schema(
            name="doc",
            document=Document(
                fields=[
                    Field(name="id", type="string", indexing=["summary"]),
                    Field(name="text", type="string", indexing=["summary"]),
                ]
            ),
            fieldsets=[FieldSet(name="default", fields=["text"])],
        )
    ],
)

Note that the name cannot have `-` or `_`.


In [3]:
from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=package)



Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 10/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Application is up!
Finished deployment.


`app` now holds a reference to a [Vespa](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.application.Vespa) instance.


## Preparing the data

In this example we use [HF Datasets](https://huggingface.co/docs/datasets/index) library to stream the
["Cohere/wikipedia-2023-11-embed-multilingual-v3"](https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3) dataset and index in our newly deployed Vespa instance.

The dataset contains wikipedia-pages, and their corresponding embeddings.

> For this exploration

The following uses the [stream](https://huggingface.co/docs/datasets/stream) option of datasets to stream the data without
downloading all the contents locally. The `map` functionality allows us to convert the
dataset fields into the expected feed format for `pyvespa` which expects a dict with the keys `id` and `fields`:

`{ "id": "vespa-document-id", "fields": {"vespa_field": "vespa-field-value"}}`


In [39]:
from datasets import load_dataset

dataset = load_dataset(
    "Cohere/wikipedia-2023-11-embed-multilingual-v3",
    "ext",
    split="train",
    streaming=False,
)
vespa_feed = dataset.map(
    lambda x: {"id": x["_id"] + "-iter", "fields": {"text": x["text"]}}
)
vespa_json_feed = vespa_feed.map(
    lambda x: {
        "put": f"id:pyvespa-feed:doc::{x['_id']}-json",
        "fields": {"text": x["text"]},
    }
)

Downloading data: 100%|██████████| 212M/212M [00:12<00:00, 17.3MB/s] 
Downloading data: 100%|██████████| 213M/213M [00:14<00:00, 14.9MB/s] 
Downloading data: 100%|██████████| 214M/214M [00:13<00:00, 16.1MB/s] 
Downloading data: 100%|██████████| 212M/212M [00:12<00:00, 17.2MB/s] 
Downloading data: 100%|██████████| 211M/211M [00:12<00:00, 16.8MB/s] 
Downloading data: 100%|██████████| 208M/208M [00:12<00:00, 16.5MB/s] 
Downloading data: 100%|██████████| 98.4M/98.4M [00:05<00:00, 17.2MB/s]


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/646424 [00:00<?, ? examples/s]

Map:   0%|          | 0/646424 [00:00<?, ? examples/s]

In [40]:
vespa_feed_iter = vespa_feed.select_columns(["id", "fields"])

## Feeding sync


In [7]:
def my_vespa_iter():
    for d in vespa_feed_iter:
        yield d


import random

ids = [f"doc_{i}" for i in range(100_000)]
large_text = " ".join(["document text"] * 1000)


def my_large_iter():
    for i in ids:
        yield {"id": i, "fields": {"text": large_text}}

In [8]:
# Find byte size of large_text
import sys

sys.getsizeof(large_text)

14048

In [41]:
app.feed_iterable(
    vespa_feed_iter, schema="doc", namespace="pyvespa-feed", callback=callback
)

Preparing feed requests...


0 Requests [00:00, ? Requests/s]

142576 Requests [02:56, 808.43 Requests/s]Exception in thread Thread-17:
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/Users/thomas/Repos/pyvespa/pyvespa/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 766, in run_closure
    _threading_Thread_run(self)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/thomas/Repos/pyvespa/vespa/application.py", line 402, in _consumer
    future: Future = executor.submit(_submit, doc, sync_session)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/concurrent/futures/thread.py", line 161, in submit
    raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeErr

KeyboardInterrupt: 

In [42]:
from vespa.application import Vespa

app = Vespa(url="http://localhost", port=8080)

In [43]:
responses = await app.feed_iterable_async(
    vespa_feed_iter, schema="doc", namespace="pyvespa-feed", callback=callback
)

Preparing feed requests...


  yield key, tuple(d[key] for d in dicts)
100%|██████████| 646424/646424 [05:13<00:00, 2058.91 Requests/s]


Requests completed


In [12]:
app.feed_iterable(
    my_large_iter(), schema="doc", namespace="pyvespa-feed", callback=callback
)

Preparing feed requests...


0 Requests [00:00, ? Requests/s]

100000 Requests [02:07, 781.36 Requests/s]

Requests completed





In [38]:
large_async_responses = await app.feed_iterable_async(
    my_large_iter(), schema="doc", namespace="pyvespa-feed", callback=callback
)

Preparing feed requests...


100%|██████████| 100000/100000 [00:42<00:00, 2358.82 Requests/s]


Requests completed


## Feeding with Vespa CLI


In [45]:
vespa_json_feed.select_columns(["put", "fields"]).to_json(
    "vespa_feed.json", orient="records", lines=True
)

Creating json from Arrow format:   0%|          | 0/647 [00:00<?, ?ba/s]

230548959

In [46]:
!head -3 vespa_feed.json

{"put":"id:pyvespa-feed:doc::20231101.simple_1_0-json","fields":{"text":"April (Apr.) is the fourth month of the year in the Julian and Gregorian calendars, and comes between March and May. It is one of the four months to have 30 days."}}
{"put":"id:pyvespa-feed:doc::20231101.simple_1_1-json","fields":{"text":"April always begins on the same day of the week as July, and additionally, January in leap years. April always ends on the same day of the week as December."}}
{"put":"id:pyvespa-feed:doc::20231101.simple_1_2-json","fields":{"text":"April comes between March and May, making it the fourth month of the year. It also comes first in the year out of the four months that have 30 days, as June, September and November are later in the year."}}


In [47]:
!vespa feed vespa_feed.json

feed: got status 400 ({"pathId":"/document/v1/pyvespa-feed/doc/docid/20231101.simple_904391_5-json","message":"Error in document 'id:pyvespa-feed:doc::20231101.simple_904391_5-json' - could not parse field 'text' of type 'string': The string field value contains illegal code point 0x10FFFE: The string field value contains illegal code point 0x10FFFE"}) for put id:pyvespa-feed:doc::20231101.simple_904391_5-json: not retryable
{
  "feeder.operation.count": 646424,
  "feeder.seconds": 35.898,
  "feeder.ok.count": 646423,
  "feeder.ok.rate": 18007.115,
  "feeder.error.count": 0,
  "feeder.inflight.count": 0,
  "http.request.count": 646424,
  "http.request.bytes": 171351348,
  "http.request.MBps": 4.773,
  "http.exception.count": 0,
  "http.response.count": 646424,
  "http.response.bytes": 87901597,
  "http.response.MBps": 2.449,
  "http.response.error.count": 1,
  "http.response.latency.millis.min": 1,
  "http.response.latency.millis.avg": 29,
  "http.response.latency.millis.max": 285,
  "

## Cleanup


In [None]:
vespa_docker.container.stop()
vespa_docker.container.remove()

NotFound: 404 Client Error for http+docker://localhost/v1.41/containers/0fd9fa9ba6a9dee5b50f321a33d41abde7f95bc0be4244cb2f7f5a61abbb3dc3/stop: Not Found ("no container with name or ID "0fd9fa9ba6a9dee5b50f321a33d41abde7f95bc0be4244cb2f7f5a61abbb3dc3" found: no such container")

## Next steps

This is just an intro into the capabilities of Vespa and pyvespa.
Browse the site to learn more about schemas, feeding and queries -
find more complex applications in
[examples](https://pyvespa.readthedocs.io/en/latest/examples.html).
