# Sparse Dense and Hybrid Search

## Remove old Weaviate DB files

- as before, downloads Jeopardy questions (in json format) using requests and parsing it into a Python list
- uses a Weaviate client, runing the database locally, and configures it to use OpenAI's API for text vectorization.
- creates the schema for the "Question" class, using the OpenAI text2vec-openai module with the "ada" model to vectorize texts
- dataset is imported in batches, where each question's answer and question and category are added as data objects to the question class
- a **Dense Search**: vector embeddings of the concept "animal" are used to find semantically similar questions.
- **Sparse Search**: using BM25, focusing on keyword-based retrieval of questions containing "animal"
- then, a hybrid search combines both vector and keyword search, with an alpha parameter to balance between dense and sparse results
- the value of alpha is tested at 0 as pure sparse, 0.5 as balanced, and 1 as pure dense.
- after each query, the results are printed in JSON format
- use cleanup functionality to delete the existing question schema if it already exists
- finally, **dense search** focuses on semantic similarity and context by utilizing vector embeddings. **Sparse search** focuses on keyword matching using traditional methods like term frequency and inverse document frequency, such as BM25.

In [1]:
!rm -rf ~/.local/share/weaviate

## Recreate the example
With the same data as in the previous lesson

In [None]:
import requests
import json

# Download the data
resp = requests.get('https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json')
data = json.loads(resp.text)  # Load data

# Parse the JSON and preview it
print(type(data), len(data))

def json_print(data):
    print(json.dumps(data, indent=2))

In [None]:
import weaviate, os
from weaviate import EmbeddedOptions
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

client = weaviate.Client(
    embedded_options=EmbeddedOptions(),
    additional_headers={
        "X-OpenAI-Api-BaseURL": os.environ['OPENAI_API_BASE'],
        "X-OpenAI-Api-Key": openai.api_key,  # Replace this with your actual key
    }
)
print(f"Client created? {client.is_ready()}")

In [None]:
# Uncomment the following two lines if you want to run this block for a second time.
if client.schema.exists("Question"):
   client.schema.delete_class("Question")

class_obj = {
    "class": "Question",
    "vectorizer": "text2vec-openai",  # Use OpenAI as the vectorizer
    "moduleConfig": {
        "text2vec-openai": {
            "model": "ada",
            "modelVersion": "002",
            "type": "text",
            "baseURL": os.environ["OPENAI_API_BASE"]
        }
    }
}

client.schema.create_class(class_obj)

In [None]:
with client.batch.configure(batch_size=5) as batch:
    for i, d in enumerate(data):  # Batch import data

        print(f"importing question: {i+1}")

        properties = {
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        }

        batch.add_data_object(
            data_object=properties,
            class_name="Question"
        )

## Queries

### Dense Search

In [None]:
response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_near_text({"concepts":["animal"]})
    .with_limit(3)
    .do()
)

json_print(response)

### Sparse Search - BM25

In [None]:
response = (
    client.query
    .get("Question",["question","answer"])
    .with_bm25(query="animal")
    .with_limit(3)
    .do()
)

json_print(response)

### Hybrid Search

In [None]:
response = (
    client.query
    .get("Question",["question","answer"])
    .with_hybrid(query="animal", alpha=0.5)
    .with_limit(3)
    .do()
)

json_print(response)

In [None]:
response = (
    client.query
    .get("Question",["question","answer"])
    .with_hybrid(query="animal", alpha=0)
    .with_limit(3)
    .do()
)

json_print(response)

In [None]:
response = (
    client.query
    .get("Question",["question","answer"])
    .with_hybrid(query="animal", alpha=1)
    .with_limit(3)
    .do()
)

json_print(response)