<a href="https://colab.research.google.com/github/erika-cardenas/recipes/blob/main/similarity_search_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Dependencies

In [None]:
!pip install weaviate-client

## Configuration

In [None]:
import weaviate
import json

client = weaviate.Client(
  url="YOUR-INSTANCE-URL",  # URL of your Weaviate instance
  auth_client_secret=weaviate.AuthApiKey(api_key="AUTH-KEY"), # (Optional) If the Weaviate instance requires authentication
  additional_headers={
    "X-huggingface-Api-Key": "hf_KEY", # Replace with your Hugging Face key
  }
)

client.schema.get()  # Get the schema to test connection

### For more information on how to deploy your model as an endpoint using the Hugging Face inference endpoints, check out the [documentation](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-huggingface#additional-information)!

## Schema

In [None]:
# resetting the schema. CAUTION: THIS WILL DELETE YOUR DATA 
client.schema.delete_all()

schema = {
   "classes": [
       {
           "class": "JeopardyQuestion",
           "description": "List of jeopardy questions",
           "vectorizer": "text2vec-huggingface",
           "moduleConfig": { # specify the vectorizer and model type you're using
              "text2vec-huggingface": {
                  "model": "sentence-transformers/all-MiniLM-L6-v2",
                  "options": {
                      "waitForModel": "true",
                      "useGPU": "true",
                      "useCache": "true"
                    }
                }
           },
           "properties": [
               {
                   "name": "Category",
                   "dataType": ["text"],
                   "description": "Category of the question",
               },
               {
                "name": "Question",
                "dataType": ["text"],
                "description": "The question",
               },
               {
                   "name": "Answer",
                   "dataType": ["text"],
                   "description": "The answer",
                }
            ]
        }
    ]
}

client.schema.create(schema)

print("Successfully created the schema.")

## Import the Data

In [None]:
import requests
url = 'https://raw.githubusercontent.com/weaviate/weaviate-examples/main/jeopardy_small_dataset/jeopardy_tiny.json'
resp = requests.get(url)
data = json.loads(resp.text)

if client.is_ready():

# Configure a batch process
  with client.batch as batch:
      batch.batch_size=100
      # Batch import all Questions
      for i, d in enumerate(data):
          print(f"importing question: {i+1}")

          properties = {
              "answer": d["Answer"],
              "question": d["Question"],
              "category": d["Category"],
          }

          client.batch.add_data_object(properties, "JeopardyQuestion")
else:
  print("The Weaviate cluster is not connected.")

## Query Weaviate: Similarity Search (Text objects)

Similarity search options for text objects in **Weaviate**:

1. [nearText](https://weaviate.io/developers/weaviate/api/graphql/vector-search-parameters#neartext)

2. [nearObject](https://weaviate.io/developers/weaviate/api/graphql/vector-search-parameters#nearobject)

3. [nearVector](https://weaviate.io/developers/weaviate/api/graphql/vector-search-parameters#nearvector)

### nearText Example

Find a `JeopardyQuestion` about "animals in movies". Limit it to only 2 responses and report the distance.

In [None]:
response = (
    client.query
    .get("JeopardyQuestion", ["question", "answer"])
    .with_near_text({
        "concepts": ["question about animals"]
    })
    .with_limit(2) # limit the output to only 2
    .with_additional(["distance", "id"]).do() # output the distance from the query vector to the retrieved objects, along with the objects ID
)

print(json.dumps(response, indent=2))

### nearObject Example

Search through the `JeopardyQuestion` class to find the top 2 objects closest to id `5e99ed1d-aef8-41b2-a55b-105810e41560`. (The id was taken from the query above)

In [None]:
response = (
    client.query
    .get("JeopardyQuestion", ["question", "answer"])
    .with_near_object({
        "id": "5e99ed1d-aef8-41b2-a55b-105810e41560"
    })
    .with_limit(2) # limit the output to only 2
    .with_additional(["distance"]) # output the distance from the query vector to the retrieved objects
    .do()
)

print(json.dumps(response, indent=2))

### nearVector Example

Search through the `JeopardyQuestion` class to find the top 2 objects closest to the query vector `[-0.0125526935, -0.021168863, ... ]`

In [None]:
response = (
    client.query
    .get("JeopardyQuestion", ["question", "answer"])
    .with_near_vector({
        "vector": [-0.0125526935, -0.021168863, ... ] # input your query vector here
    })
    .with_limit(2) # limit the output to only 2
    .with_additional(["distance"]) # output the distance from the query vector to the retrieved objects
    .do()
)

print(json.dumps(response, indent=2))