Skip to content

Commit

Permalink
too many (#669)
Browse files Browse the repository at this point in the history
  • Loading branch information
jobergum committed Feb 2, 2024
1 parent 2e8be33 commit 7961c04
Showing 1 changed file with 68 additions and 105 deletions.
173 changes: 68 additions & 105 deletions docs/sphinx/source/examples/nomic-embeddings-cloud.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,15 @@
"</picture>\n",
"\n",
"\n",
"# Arxiv AI-powered search\n",
"# Arxiv AI-powered search with Nomaic (nomic-embed-text-v1) and Vespa\n",
"\n",
"This notebook demonstrates how to load a ArxiV dataset hosted on [HF datasets](https://huggingface.co/datasets/somewheresystems/dataclysm-arxiv) \n",
"and feed it to a Vespa instance. The dataset comprises of English language arXiv papers from the Cornell/arXiv dataset, with two new columns added: title-embeddings and abstract-embeddings. Embeddings generated using the [bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) embeddings model. \n",
" This notebook demonstrates how to use the recently announced open-source Nomic embedding model\n",
"([nomic blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1)) with Vespa. It also demonstrates how to \n",
"load an ArxiV dataset hosted on [HF datasets](https://huggingface.co/datasets/somewheresystems/dataclysm-arxiv) \n",
"and feed it to a Vespa instance.\n",
"\n",
"In this notebook, we use Vespa's embedder functionality to include the [bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) embedding\n",
"model into Vespa for query serving. \n",
"\n",
"This is work in progress - we want to demonstrate more query examples. "
"In this notebook, we use Vespa's embedder functionality to include the [nomic-ai/nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1) embedding model into Vespa for query serving. We use a quantized model to improve CPU inference performance. \n"
]
},
{
Expand All @@ -43,13 +43,16 @@
"[PyVespa](https://pyvespa.readthedocs.io/en/latest/) helps us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html). \n",
"A Vespa application package consists of configuration files, schemas, models, and code (plugins). \n",
"\n",
"First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. This is a translation\n",
"of the dataset features:"
"First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. \n",
"\n",
"The nomic technical report [(pdf)](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) mentions\n",
"that the model was trained with prefix instructions, using `search_document` as a prefix for documents and `search_query` \n",
"as a prefix for queries. We add the prefix in the embed infere. "
]
},
{
"cell_type": "code",
"execution_count": 93,
"execution_count": 2,
"id": "0dca2378",
"metadata": {},
"outputs": [],
Expand All @@ -68,14 +71,10 @@
" Field(name=\"journal_ref\", type=\"string\", indexing=[\"summary\", \"index\"]),\n",
" Field(name=\"doi\", type=\"string\", indexing=[\"summary\", \"index\"]),\n",
" Field(name=\"categories\", type=\"array<string>\", indexing=[\"summary\", \"index\"], match=[\"word\"]),\n",
" Field(name=\"title_embedding\", type=\"tensor<bfloat16>(x[384])\",\n",
" indexing=[\"attribute\", \"index\"],\n",
" ann=HNSW(distance_metric=\"angular\")\n",
" ),\n",
" Field(name=\"abstract_embedding\", type=\"tensor<bfloat16>(x[384])\",\n",
" indexing=[\"attribute\", \"index\"],\n",
" ann=HNSW(distance_metric=\"angular\")\n",
" ),\n",
" Field(name=\"embedding\", type=\"tensor<bfloat16>(x[768])\",\n",
" indexing=[\"\\\"search_document\\\" . \\\" \\\" . input title . \\\" \\\". input abstract \", \"embed\", \"index\", \"attribute\"],\n",
" ann=HNSW(distance_metric=\"angular\"),\n",
" is_document_field=False) \n",
" ],\n",
" ),\n",
" fieldsets=[\n",
Expand All @@ -84,9 +83,18 @@
")"
]
},
{
"cell_type": "markdown",
"id": "3acc9020",
"metadata": {},
"source": [
"## Configure embedder\n",
"This uses Vespa embedder inference support, we use the Xenova (Transformer.js) model checkpoints in ONNX. "
]
},
{
"cell_type": "code",
"execution_count": 94,
"execution_count": 3,
"id": "66c5da1d",
"metadata": {},
"outputs": [],
Expand All @@ -97,11 +105,10 @@
"vespa_application_package = ApplicationPackage(\n",
" name=vespa_app_name,\n",
" schema=[paper_schema],\n",
" components=[Component(id=\"bge\", type=\"hugging-face-embedder\",\n",
" components=[Component(id=\"nomic\", type=\"hugging-face-embedder\",\n",
" parameters=[\n",
" Parameter(\"transformer-model\", {\"url\": \"https://huggingface.co/Xenova/bge-small-en-v1.5/resolve/main/onnx/model.onnx\"}),\n",
" Parameter(\"tokenizer-model\", {\"url\": \"https://huggingface.co/Xenova/bge-small-en-v1.5/raw/main/tokenizer.json\"}),\n",
" Parameter(\"pooling-strategy\", args=dict(), children=\"cls\")\n",
" Parameter(\"transformer-model\", {\"url\": \"https://huggingface.co/Xenova/nomic-embed-text-v1/resolve/main/onnx/model_quantized.onnx\"}),\n",
" Parameter(\"tokenizer-model\", {\"url\": \"https://huggingface.co/Xenova/nomic-embed-text-v1/raw/main/tokenizer.json\"})\n",
" ]\n",
" )]\n",
") "
Expand All @@ -127,7 +134,7 @@
},
{
"cell_type": "code",
"execution_count": 101,
"execution_count": 4,
"id": "a8ce5624",
"metadata": {},
"outputs": [],
Expand All @@ -136,7 +143,7 @@
"\n",
"bm25 = RankProfile(\n",
" name=\"bm25\", \n",
" inputs=[(\"query(q)\", \"tensor<float>(x[384])\")],\n",
" inputs=[(\"query(q)\", \"tensor<float>(x[768])\")],\n",
" \n",
" first_phase=FirstPhaseRanking(\n",
" expression=\"bm25(title) + bm25(abstract)\",\n",
Expand All @@ -145,14 +152,14 @@
"\n",
"hybrid = RankProfile(\n",
" name=\"hybrid\", \n",
" inputs=[(\"query(q)\", \"tensor<float>(x[384])\")],\n",
" inputs=[(\"query(q)\", \"tensor<float>(x[768])\")],\n",
" first_phase=FirstPhaseRanking(\n",
" expression=\"closeness(field, title_embedding) + closeness(field, abstract_embedding)\"\n",
" expression=\"closeness(field, embedding)\"\n",
" ),\n",
" global_phase=GlobalPhaseRanking(\n",
" expression=\"reciprocal_rank_fusion(closeness(field,title_embedding), bm25(title), bm25(abstract), closeness(field,abstract_embedding))\"\n",
" expression=\"reciprocal_rank_fusion(closeness(field,embedding), bm25(title), bm25(abstract))\"\n",
" ),\n",
" match_features=[\"bm25(title)\", \"bm25(abstract)\", \"closeness(field, title_embedding)\", \"closeness(field, abstract_embedding)\"]\n",
" match_features=[\"bm25(title)\", \"bm25(abstract)\", \"closeness(field, embedding)\"]\n",
")\n",
"paper_schema.add_rank_profile(bm25)\n",
"paper_schema.add_rank_profile(hybrid)"
Expand Down Expand Up @@ -244,7 +251,7 @@
"source": [
"import os\n",
"\n",
"os.environ[\"TENANT_NAME\"] = \"vespa-team\" # Replace with your tenant name\n",
"os.environ[\"TENANT_NAME\"] = \"samples\" # Replace with your tenant name\n",
"\n",
"vespa_cli_command = f'vespa config set application {os.environ[\"TENANT_NAME\"]}.{vespa_app_name}'\n",
"\n",
Expand All @@ -263,7 +270,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 8,
"id": "1f0b97c8",
"metadata": {},
"outputs": [],
Expand Down Expand Up @@ -328,7 +335,7 @@
},
{
"cell_type": "code",
"execution_count": 103,
"execution_count": 10,
"id": "b5fddf9f",
"metadata": {},
"outputs": [],
Expand Down Expand Up @@ -364,50 +371,10 @@
},
{
"cell_type": "code",
"execution_count": 104,
"execution_count": null,
"id": "fe954dc4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Deployment started in run 7 of dev-aws-us-east-1c for samples.arxivsearch. This may take a few minutes the first time.\n",
"INFO [12:01:11] Deploying platform version 8.284.4 and application dev build 7 for dev-aws-us-east-1c of default ...\n",
"INFO [12:01:11] Using CA signed certificate version 0\n",
"INFO [12:01:12] Using 1 nodes in container cluster 'arxivsearch_container'\n",
"INFO [12:01:13] Using 1 nodes in container cluster 'arxivsearch_container'\n",
"INFO [12:01:15] Deployment successful.\n",
"INFO [12:01:15] Session 247 for tenant 'samples' prepared and activated.\n",
"INFO [12:01:15] ######## Details for all nodes ########\n",
"INFO [12:01:15] h90001f.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP\n",
"INFO [12:01:15] --- platform vespa/cloud-tenant-rhel8:8.284.4\n",
"INFO [12:01:15] --- logserver-container on port 4080 has config generation 247, wanted is 247\n",
"INFO [12:01:15] --- metricsproxy-container on port 19092 has config generation 247, wanted is 247\n",
"INFO [12:01:15] h90001g.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP\n",
"INFO [12:01:15] --- platform vespa/cloud-tenant-rhel8:8.284.4\n",
"INFO [12:01:15] --- container-clustercontroller on port 19050 has config generation 247, wanted is 247\n",
"INFO [12:01:15] --- metricsproxy-container on port 19092 has config generation 247, wanted is 247\n",
"INFO [12:01:15] h90024a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP\n",
"INFO [12:01:15] --- platform vespa/cloud-tenant-rhel8:8.284.4\n",
"INFO [12:01:15] --- container on port 4080 has config generation 247, wanted is 247\n",
"INFO [12:01:15] --- metricsproxy-container on port 19092 has config generation 247, wanted is 247\n",
"INFO [12:01:15] h90026a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP\n",
"INFO [12:01:15] --- platform vespa/cloud-tenant-rhel8:8.284.4\n",
"INFO [12:01:15] --- storagenode on port 19102 has config generation 246, wanted is 247\n",
"INFO [12:01:15] --- searchnode on port 19107 has config generation 247, wanted is 247\n",
"INFO [12:01:15] --- distributor on port 19111 has config generation 246, wanted is 247\n",
"INFO [12:01:15] --- metricsproxy-container on port 19092 has config generation 247, wanted is 247\n",
"INFO [12:01:21] Found endpoints:\n",
"INFO [12:01:21] - dev.aws-us-east-1c\n",
"INFO [12:01:21] |-- https://fa63b7b7.e9029380.z.vespa-app.cloud/ (cluster 'arxivsearch_container')\n",
"INFO [12:01:21] Installation succeeded!\n",
"Using mTLS (key,cert) Authentication against endpoint https://fa63b7b7.e9029380.z.vespa-app.cloud//ApplicationStatus\n",
"Application is up!\n",
"Finished deployment.\n"
]
}
],
"outputs": [],
"source": [
"from vespa.application import Vespa\n",
"app:Vespa = vespa_cloud.deploy()"
Expand All @@ -426,7 +393,7 @@
},
{
"cell_type": "code",
"execution_count": 105,
"execution_count": null,
"id": "8f422178",
"metadata": {},
"outputs": [],
Expand All @@ -442,8 +409,6 @@
" \"id\": x[\"id\"],\n",
" \"title\": x[\"title\"],\n",
" \"abstract\": x[\"abstract\"],\n",
" \"title_embedding\": x[\"title_embedding\"],\n",
" \"abstract_embedding\": x[\"abstract_embedding\"],\n",
" \"journal_ref\": x.get(\"journal-ref\",None),\n",
" \"doi\": x.get(\"doi\",None),\n",
" \"categories\": x[\"categories\"],\n",
Expand Down Expand Up @@ -483,7 +448,7 @@
},
{
"cell_type": "code",
"execution_count": 109,
"execution_count": 13,
"id": "b9349fb4",
"metadata": {},
"outputs": [
Expand All @@ -493,33 +458,31 @@
"text": [
"[\n",
" {\n",
" \"id\": \"index:arxivsearch_content/0/cfdff72f28cffdb0b73f6026\",\n",
" \"relevance\": 0.06384129063829451,\n",
" \"id\": \"index:arxivsearch_content/0/93be86dc99f3b26b012796f6\",\n",
" \"relevance\": 0.04918032786885246,\n",
" \"source\": \"arxivsearch_content\",\n",
" \"fields\": {\n",
" \"matchfeatures\": {\n",
" \"bm25(abstract)\": 0.0,\n",
" \"bm25(title)\": 0.0,\n",
" \"closeness(field,abstract_embedding)\": 0.6178772298066597,\n",
" \"closeness(field,title_embedding)\": 0.6288338602029975\n",
" \"bm25(abstract)\": 16.812307256543374,\n",
" \"bm25(title)\": 10.293464220282218,\n",
" \"closeness(field,embedding)\": 0.5253550471303831\n",
" },\n",
" \"id\": \"0812.3122\",\n",
" \"title\": \"Cosmological constraints on unifying Dark Fluid models\"\n",
" \"id\": \"704.0003\",\n",
" \"title\": \"The evolution of the Earth-Moon system based on the dark matter field\\n fluid model\"\n",
" }\n",
" },\n",
" {\n",
" \"id\": \"index:arxivsearch_content/0/c77e9d766bd90c894a5d0481\",\n",
" \"relevance\": 0.06198484047241319,\n",
" \"id\": \"index:arxivsearch_content/0/9fba7416eaa319e25a2b7b6f\",\n",
" \"relevance\": 0.047619047619047616,\n",
" \"source\": \"arxivsearch_content\",\n",
" \"fields\": {\n",
" \"matchfeatures\": {\n",
" \"bm25(abstract)\": 0.0,\n",
" \"bm25(title)\": 0.0,\n",
" \"closeness(field,abstract_embedding)\": 0.5754037589718138,\n",
" \"closeness(field,title_embedding)\": 0.6644048114912198\n",
" \"bm25(abstract)\": 6.052146920083321,\n",
" \"bm25(title)\": 3.3033206224361487,\n",
" \"closeness(field,embedding)\": 0.4964298779859894\n",
" },\n",
" \"id\": \"0711.0466\",\n",
" \"title\": \"A Model for Dark Matter Halos\"\n",
" \"id\": \"704.0077\",\n",
" \"title\": \"Universal Forces and the Dark Energy Problem\"\n",
" }\n",
" }\n",
"]\n"
Expand All @@ -531,12 +494,12 @@
"import json\n",
"\n",
"response:VespaQueryResponse = app.query(\n",
" yql=\"select title, id from paper where ({targetHits:10}nearestNeighbor(title_embedding,q)) or ({targetHits:10}nearestNeighbor(abstract_embedding,q))\",\n",
" yql=\"select title, id from paper where ({targetHits:10}nearestNeighbor(embedding,q)) or userQuery()\",\n",
" ranking=\"hybrid\",\n",
" query=\"dark matter field fluid model\",\n",
" body={\n",
" \"presentation.format.tensors\": \"short-value\",\n",
" \"input.query(q)\": \"embed(bge, \\\"dark matter field fluid model\\\")\",\n",
" \"input.query(q)\": \"embed(nomic, \\\"search_query dark matter field fluid model\\\")\",\n",
" }\n",
")\n",
"assert(response.is_successful())\n",
Expand All @@ -545,7 +508,7 @@
},
{
"cell_type": "code",
"execution_count": 108,
"execution_count": 14,
"id": "405cdb72",
"metadata": {},
"outputs": [
Expand All @@ -555,21 +518,21 @@
"text": [
"[\n",
" {\n",
" \"id\": \"index:arxivsearch_content/0/cfdff72f28cffdb0b73f6026\",\n",
" \"relevance\": 31.398304828681407,\n",
" \"id\": \"index:arxivsearch_content/0/93be86dc99f3b26b012796f6\",\n",
" \"relevance\": 27.10577147682559,\n",
" \"source\": \"arxivsearch_content\",\n",
" \"fields\": {\n",
" \"id\": \"0812.3122\",\n",
" \"title\": \"Cosmological constraints on unifying Dark Fluid models\"\n",
" \"id\": \"704.0003\",\n",
" \"title\": \"The evolution of the Earth-Moon system based on the dark matter field\\n fluid model\"\n",
" }\n",
" },\n",
" {\n",
" \"id\": \"index:arxivsearch_content/0/6033639d686a018894cdd4ec\",\n",
" \"relevance\": 30.574650705468287,\n",
" \"id\": \"index:arxivsearch_content/0/9fba7416eaa319e25a2b7b6f\",\n",
" \"relevance\": 9.35546754251947,\n",
" \"source\": \"arxivsearch_content\",\n",
" \"fields\": {\n",
" \"id\": \"0812.3611\",\n",
" \"title\": \"Dark Energy vs. Dark Matter: Towards a Unifying Scalar Field?\"\n",
" \"id\": \"704.0077\",\n",
" \"title\": \"Universal Forces and the Dark Energy Problem\"\n",
" }\n",
" }\n",
"]\n"
Expand All @@ -595,7 +558,7 @@
"source": [
"## Summary\n",
"\n",
"This notebook demonstrates how to interact with HF datasets, including embedding models in Vespa and querying. "
"This notebook demonstrates how to interact with HF datasets, including embedding models in Vespa and querying. Now we can delete the Vespa cloud instance!"
]
},
{
Expand Down

0 comments on commit 7961c04

Please sign in to comment.