# Using Weaviate embedded for question/answering on your vector store

In this tutorial we use Weaviate Embedded to create a vector store and question/answering from transcribed podcasts. The steps will include uploading your data from a local store, and creating a schema as well as an object store.

In [12]:
import weaviate
from weaviate.embedded import EmbeddedOptions
import json
import helper

An embedded Weaviate instance allows us to have the source data saved and retrieved locally, while having access to the vectorizing modules available. 

In [None]:
client = weaviate.Client(
    embedded_options=EmbeddedOptions(
        additional_env_vars={
        "ENABLE_MODULES":
        "text2vec-openai,text2vec-cohere,text2vec-huggingface"}
    )
)

In [None]:
#Print the client information to confirm the modules are loaded.
meta_info = client.get_meta()
print(json.dumps(meta_info, indent=2))

In the cell below we setup the schema, an outline requiring the data type, vectorize and class relations. Note that it is essential to have your data cleaned and the categories clearly identified for this step. If using your own vectorizer, "none" should be specified for "vectorizer". 

In [4]:
client.schema.delete_all()
schema = {
    "classes": [
        {
            "class": "Podcast",
            "vectorizer": "text2vec-cohere",
            "properties": [
                {
                    "name": "title",
                    "dataType": ["text"]
                },
                {
                    "name": "transcript",
                    "dataType": ["text"]
                }
            ]
        }
    ]
}
client.schema.create(schema)

{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"podcast_ORJj3GQyYa6n","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-10-03T07:59:40-07:00","took":118745}


In the following cells we load the locally stored data (in json format) and create a function definition for an add_podcast object. 

The name of the object represents the highest level classification for your data, indicated below as podcast_object (in dictionary type). Target class represents the next level in the classification of your data. Here we indicate it below as the string "Podcast", but note that multiple classes could have been specified, for example, if we had different categories of podcasts, such as English, Spanish, etc.

In [5]:
with open("/Users/tdubon/DEMO-semantic-search-podcast/data/podcast_ds.json", 'r') as f:
    datastore = json.load(f)

In [6]:
def add_podcasts(batch_size = 1):
    no_items_in_batch = 0
    for item in datastore:
        podcast_object = {
            "title": item["title"],
            "transcript": item["transcript"]
        }

        podcast_uuid = helper.generate_uuid('podcast', item["title"] + item["url"])
        client.batch.add_data_object(podcast_object, "Podcast", podcast_uuid)
        no_items_in_batch += 1

        if no_items_in_batch >= batch_size:
            results = client.batch.create_objects()

            for result in results:
                    if result['result'] != {}:
                        helper.log(result['result'])

            message = str(item["title"]) + ' imported'
            helper.log(message)

            no_items_in_batch = 0

    client.batch.create_objects()


In [8]:
add_podcasts(1)

AttributeError: module 'helper' has no attribute 'generate_uuid'

Next you would implement the pipeline for whatever steps you need to take to query your data, such as semantic search, generative search, question/answering. In this example we illustrate question/answering.

In [None]:
#Question answering - search wikicity object to answer "when was London Olympics"
ask = {
  "question": "What is this podcast about?",
  "properties": ["podcast_summary"]
}

res = (
  client.query
  .get("podcast_object", [
      "podcast_subject",
      "_additional {answer {hasAnswer property result} }"
  ])
  .with_ask(ask)
  .with_limit(1)
  .do()
)

print(json.dumps(res, indent=2))