# Using Weaviate embedded for question/answering on your vector store

In this tutorial we use Weaviate Embedded to create a vector store and question/answering from transcribed podcasts. The steps will include uploading your data from a local store, and creating a schema as well as an object store.

In [7]:
import weaviate
from weaviate.embedded import EmbeddedOptions
from weaviate.util import generate_uuid5
import json
import helper

An embedded Weaviate instance allows us to have the source data saved and retrieved locally, while having access to the vectorizing modules available. 

In [2]:
client = weaviate.Client(
    embedded_options=EmbeddedOptions(
        additional_env_vars={
        "ENABLE_MODULES":
        "text2vec-openai,text2vec-cohere,text2vec-huggingface"}
    )
)

Started /Users/tdubon/.cache/weaviate-embedded: process ID 3033


{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-10-04T11:54:33-07:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-10-04T11:54:33-07:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"podcast_ORJj3GQyYa6n","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-10-04T11:54:33-07:00","took":124744}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-10-04T11:54:33-07:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-10-04T11:54:33-07:00"}


In [3]:
#Print the client information to confirm the modules are loaded.
meta_info = client.get_meta()
print(json.dumps(meta_info, indent=2))

{
  "hostname": "http://127.0.0.1:6666",
  "modules": {
    "text2vec-cohere": {
      "documentationHref": "https://docs.cohere.ai/embedding-wiki/",
      "name": "Cohere Module"
    },
    "text2vec-huggingface": {
      "documentationHref": "https://huggingface.co/docs/api-inference/detailed_parameters#feature-extraction-task",
      "name": "Hugging Face Module"
    },
    "text2vec-openai": {
      "documentationHref": "https://platform.openai.com/docs/guides/embeddings/what-are-embeddings",
      "name": "OpenAI Module"
    }
  },
  "version": "1.21.1"
}


In the cell below we setup the schema, an outline requiring the data type, vectorizer and the list of classes. Note that it is essential to have your data cleaned and the categories clearly identified for this step. If using your own vectorizer, "none" should be specified for "vectorizer". 

In [4]:
client.schema.delete_all()
schema = {
    "classes": [
        {
            "class": "Podcast",
            "vectorizer": "text2vec-cohere",
            "properties": [
                {
                    "name": "title",
                    "dataType": ["text"]
                },
                {
                    "name": "transcript",
                    "dataType": ["text"]
                }
            ]
        }
    ]
}
client.schema.create(schema)

{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"podcast_pEughRQxrV7H","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-10-04T11:54:48-07:00","took":92722}


In the following cells we load the locally stored data (in json format) and create a function definition for an add_podcast object. 

The name of the object represents the highest level classification for your data, indicated below as podcast_object (in dictionary type). Target class represents the next level in the classification of your data. Here we indicate it below as the string "Podcast", but note that multiple classes could have been specified, for example, if we had different categories of podcasts, such as English, Spanish, etc.

The function definition below is implementing batch_size=1. Note that with larger amounts of data you will want to adjust this setting. Per the documentation: "batch imports are used to maximize import speed and minimize network latency. Batch import processes multiple objects per request, and clients can parallelize the process."

In [5]:
with open("/Users/tdubon/DEMO-semantic-search-podcast/data/podcast_ds.json", 'r') as f:
    datastore = json.load(f)



In [18]:
def add_podcasts(batch_size = 1):
    no_items_in_batch = 0
    for item in datastore:
        podcast_object = {
            "title": item["title"],
            "transcript": item["transcript"]
        }

        podcast_uuid = generate_uuid5('podcast', item["title"] + item["transcript"])
        client.batch.add_data_object(podcast_object, "Podcast", podcast_uuid)
        no_items_in_batch += 1

        if no_items_in_batch >= batch_size:
            results = client.batch.create_objects()

            for result in results:
                    if result['result'] != {}:
                        helper.log(result['result'])

            message = str(item["title"]) + ' imported'
            helper.log(message)

            no_items_in_batch = 0

    client.batch.create_objects()


In [19]:
add_podcasts(1)

{'error': [{'message': 'update vector: Cohere API Key: no api key found neither in request header: X-Cohere-Api-Key nor in environment variable under COHERE_APIKEY'}]}
2023-10-04 19:34:51.134682 | {'errors': {'error': [{'message': 'update vector: Cohere API Key: no api key found neither in request header: X-Cohere-Api-Key nor in environment variable under COHERE_APIKEY'}]}, 'status': 'FAILED'}
2023-10-04 19:34:51.134962 | go-time-154.md imported
{'error': [{'message': 'update vector: Cohere API Key: no api key found neither in request header: X-Cohere-Api-Key nor in environment variable under COHERE_APIKEY'}]}
2023-10-04 19:34:51.152580 | go-time-155.md imported
{'error': [{'message': 'update vector: Cohere API Key: no api key found neither in request header: X-Cohere-Api-Key nor in environment variable under COHERE_APIKEY'}]}
2023-10-04 19:34:51.172640 | {'errors': {'error': [{'message': 'update vector: Cohere API Key: no api key found neither in request header: X-Cohere-Api-Key nor i

Next you would implement the pipeline for whatever steps you need to take to query your data, such as semantic search, generative search, question/answering. 

In [29]:
#Question answering - search 

response = (
    client.query
    .get("Podcast", ["transcript"])
    .with_near_text({"concepts": ["biology"]})
    .with_limit(3)
    .do()
)

print(json.dumps(res, indent=2))

{
  "errors": [
    {
      "locations": [
        {
          "column": 23,
          "line": 1
        }
      ],
      "message": "Unknown argument \"ask\" on field \"Podcast\" of type \"GetObjectsObj\".",
      "path": null
    },
    {
      "locations": [
        {
          "column": 99,
          "line": 1
        }
      ],
      "message": "Cannot query field \"question\" on type \"Podcast\".",
      "path": null
    },
    {
      "locations": [
        {
          "column": 121,
          "line": 1
        }
      ],
      "message": "Cannot query field \"answer\" on type \"PodcastAdditional\".",
      "path": null
    }
  ]
}
