<picture>
  <source media="(prefers-color-scheme: dark)" srcset="https://vespa.ai/assets/vespa-ai-logo-heather.svg">
  <source media="(prefers-color-scheme: light)" srcset="https://vespa.ai/assets/vespa-ai-logo-rock.svg">
  <img alt="#Vespa" width="200" src="https://vespa.ai/assets/vespa-ai-logo-rock.svg" style="margin-bottom: 25px;">
</picture>

# Feeding performance

This explorative notebook intends to shine some light on the different modes of feeding documents to Vespa.
We will look at these 3 different methods:

1. Using `feed_iterable()`.
2. Using `feed_iterable_async()`
3. Using [Vespa CLI](https://docs.vespa.ai/en/vespa-cli)


<div class="alert alert-info">
    Refer to <a href="https://pyvespa.readthedocs.io/en/latest/troubleshooting.html">troubleshooting</a>
    for any problem when running this guide.
</div>


[Install pyvespa](https://pyvespa.readthedocs.io/) and start Docker Daemon, validate minimum 6G available:


In [1]:
#!pip3 install pyvespa
#!docker info | grep "Total Memory"

## Create an application package

The [application package](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.ApplicationPackage)
has all the Vespa configuration files -
create one from scratch:


In [2]:
from vespa.package import (
    ApplicationPackage,
    Field,
    Schema,
    Document,
    FieldSet,
)

package = ApplicationPackage(
    name="pyvespafeed",
    schema=[
        Schema(
            name="doc",
            document=Document(
                fields=[
                    Field(name="id", type="string", indexing=["summary"]),
                    Field(name="text", type="string", indexing=["summary"]),
                ]
            ),
            fieldsets=[FieldSet(name="default", fields=["text"])],
        )
    ],
)

Note that the `ApplicationPackage` name cannot have `-` or `_`.


## Deploy the Vespa application

Deploy `package` on the local machine using Docker,
without leaving the notebook, by creating an instance of
[VespaDocker](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.deployment.VespaDocker). `VespaDocker` connects
to the local Docker daemon socket and starts the [Vespa docker image](https://hub.docker.com/r/vespaengine/vespa/).

If this step fails, please check
that the Docker daemon is running, and that the Docker daemon socket can be used by clients (Configurable under advanced settings in Docker Desktop).


In [3]:
from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=package)



Waiting for configuration server, 0/300 seconds...


`app` now holds a reference to a [Vespa](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.application.Vespa) instance.


## Preparing the data

In this example we use [HF Datasets](https://huggingface.co/docs/datasets/index) library to stream the
["Cohere/wikipedia-2023-11-embed-multilingual-v3"](https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3) dataset and index in our newly deployed Vespa instance.

The dataset contains wikipedia-pages, and their corresponding embeddings.

> For this exploration we will just use the `id` and `text`-fields

The following uses the [stream](https://huggingface.co/docs/datasets/stream) option of datasets to stream the data without
downloading all the contents locally.

The `map` functionality allows us to convert the
dataset fields into the expected feed format for `pyvespa` which expects a dict with the keys `id` and `fields`:

`{ "id": "vespa-document-id", "fields": {"vespa_field": "vespa-field-value"}}`


In [None]:
from datasets import load_dataset

dataset = load_dataset(
    "Cohere/wikipedia-2023-11-embed-multilingual-v3",
    "ext",
    split="train",
    streaming=False,
)
vespa_feed = dataset.map(
    lambda x: {"id": x["_id"] + "-iter", "fields": {"text": x["text"]}}
)

In [None]:
vespa_feed_iter = vespa_feed.select_columns(["id", "fields"])

In [None]:
from vespa.application import Vespa

app = Vespa(url="http://localhost", port="8080")

## Feeding sync


In [None]:
import time

start = time.time()
responses = app.feed_iterable(
    vespa_feed_iter,
    schema="doc",
    namespace="pyvespa-feed",
    operation_type="update",
    create=True,
)
end = time.time()
sync_feed_time = end - start
print(f"Feed time sync: {sync_feed_time}")

Preparing update requests...


10967 Requests [00:13, 790.26 Requests/s]

Requests completed
Feed time sync: 13.880165815353394





In [None]:
start = time.time()
responses = await app.feed_iterable_async(
    vespa_feed_iter, schema="doc", namespace="pyvespa-feed"
)
end = time.time()
async_feed_time = end - start
print(f"Feed time async: {async_feed_time}")

Preparing feed requests...


100%|██████████| 10967/10967 [00:04<00:00, 2446.03 Requests/s]

Requests completed
Feed time async: 4.682251930236816





## Feeding with Vespa CLI

[Vespa CLI](https://docs.vespa.ai/en/vespa-cli) is a command-line interface for interacting with Vespa.
Among many useful features are a `vespa feed` command that is the recommended way of feeding large datasets into Vespa.


## Prepare the data for Vespa CLI

Vespa CLI can feed data from either many .json files or a single .jsonl file with many documents.
The json format needs to be in the following format:

```json
{
  "put": "id:namespace:document-type::document-id",
  "fields": {
    "field1": "value1",
    "field2": "value2"
  }
}
```

Where, `put` is the document operation in this case. Other allowed operations are `get`, `update` and `remove`.

For reference, see https://docs.vespa.ai/en/vespa-cli#cheat-sheet


In [None]:
vespa_json_feed = vespa_feed.map(
    lambda x: {
        "put": f"id:pyvespa-feed:doc::{x['_id']}-json",
        "fields": {"text": x["text"]},
    }
)

In [None]:
vespa_json_feed.select_columns(["put", "fields"]).to_json(
    "vespa_feed.json", orient="records", lines=True
)

Creating json from Arrow format:   0%|          | 0/11 [00:00<?, ?ba/s]

4505289

In [None]:
!head -3 vespa_feed.json

{"put":"id:pyvespa-feed:doc::20231101.ext_246_0-json","fields":{"text":"G\u00fciquipedia Ay\u00faa Zona prevas \u00cdndizi A\u2013Z El conceju La troji Embassy Help for non-Extremaduran speakers Ayuda para quienes no hablan estreme\u00f1u'''"}}
{"put":"id:pyvespa-feed:doc::20231101.ext_246_1-json","fields":{"text":"|class=\"MainPageBG\" style=\"width: 50%; border: 1px solid #006600; background-color: #F5FBEF; vertical-align: top; -moz-border-radius:10px; font-size:90% \" |"}}
{"put":"id:pyvespa-feed:doc::20231101.ext_247_0-json","fields":{"text":".com (el ingr\u00e9s commercial, comercial) es un domi\u00f1u d'internet hen\u00e9ricu que horma parti el sistema e domi\u00f1us d'internet. El domi\u00f1u .com es unu los domi\u00f1us orihinalis d'internet, hue estableciu en Eneru e 1985 i oga\u00f1u es llevau pola compa\u00f1ia VeriSign."}}


In [None]:
results = !vespa feed vespa_feed.json

In [None]:
results

['Error: deployment not converged: Get "http://127.0.0.1:19071/application/v2/tenant/default/application/default/environment/prod/region/default/instance/default/serviceconverge": dial tcp 127.0.0.1:19071: connect: connection refused']

## Cleanup


In [None]:
vespa_docker.container.stop()
vespa_docker.container.remove()

## Next steps

This is just an intro into the capabilities of Vespa and pyvespa.
Browse the site to learn more about schemas, feeding and queries -
find more complex applications in
[examples](https://pyvespa.readthedocs.io/en/latest/examples.html).
