## Semantic Search using Weaviate and Docker 

In this tutorial we create a vector store that can be queried using semantic search on a sample dataset composed of transcribed podcasts. The steps will include uploading your data from a local store, and creating a schema as well as an object store.


In your terminal:  
1. Run your virtual environment:  conda activate /Users/your_path/environment_name OR source path_to_your_VR/bin/activate
2. Download and run the yml image doc in this repo
3. Run docker-compose up -d
4. Run pip install -r requirements.txt

In [11]:
import weaviate
from weaviate.util import generate_uuid5
import json


{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"podcast_7gZn71E8okke","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-10-09T08:38:20-07:00","took":88134}


Local docker container setup with text2vec-openai vectorizer module specified in yml file
More on modules: https://weaviate.io/developers/weaviate/modules

In [None]:
#Instatiate the client with rest API
client = weaviate.Client("http://localhost:8080")

In [None]:
#Print the client information to confirm the modules are loaded.
meta_info = client.get_meta()
print(json.dumps(meta_info, indent=2))

In the cell below we setup the schema, an outline requiring the data type, vectorizer and the list of classes. Note that it is essential to have your data cleaned and the categories clearly identified for this step. If using your own vectorizer, "none" should be specified for "vectorizer". 

In [None]:
client.schema.delete_class("Podcast")

schema = {
    "classes": [
        {
            "class": "Podcast",
            "vectorizer": "text2vec-openai",
            "properties": [
                {
                    "name": "title",
                    "dataType": ["text"]
                },
                {
                    "name": "transcript",
                    "dataType": ["text"]
                }
            ]
        }
    ]
}
client.schema.create(schema)

In the following cells we load the locally stored data (in json format) and create a function definition for an add_podcast object. 

The name of the object represents the highest level classification for your data, indicated below as podcast_object (in dictionary type). Target class represents the next level in the classification of your data. Here we indicate it below as the string "Podcast", but note that multiple classes could have been specified, for example, if we had different categories of podcasts, such as English, Spanish, etc.

The function definition below is implementing batch_size=1. Note that with larger amounts of data you will want to adjust this setting. Per the documentation: "batch imports are used to maximize import speed and minimize network latency. Batch import processes multiple objects per request, and clients can parallelize the process."

In [None]:
with open("/Users/tdubon/DEMO-semantic-search-podcast/data/podcast_ds.json", 'r') as f:
    datastore = json.load(f)

In [None]:
print(json.dumps(datastore, indent=2))

In the cell below we define the batch and the uuid.

Batch definition is helpful because it's "a way of importing/creating objects and references in bulk using a single API request to the Weaviate server." 

In [None]:
def add_podcasts(batch_size = 50):
    client.batch.configure(batch_size=1)
    with client.batch as batch:
        for i, d in enumerate(datastore):
            print(f"importing podcast: {i+1}")
            properties = {
                "title": d["title"],
                "transcript": d["transcript"]
            }
            podcast_uuid = generate_uuid5('podcast', d["title"] + d["transcript"])
        
            batch.add_data_object(
                data_object=properties, 
                class_name="Podcast"
                uuid=podcast_uuid)
    

In [None]:
add_podcasts(1)

Next you implement the pipeline and query your data, such as semantic search, generative search, question/answering. In this example we use nearText with the module text2vec-openai which implments text-embedding-ada-002. 

In [None]:
#Question answering - search 

response = (
    client.query
    .get("Podcast", ["transcript"])
    .with_near_text({"concepts": ["semantic search"]})
    .with_limit(3)
    .with_additional(["distance"])
    .do()
)

print(json.dumps(response, indent=2))