[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/weaviate-features/model-providers/huggingface/similarity_search_all_MiniLM_L6_v2.ipynb)

## Similarity search with all_MiniLM_L6_v2 via Hugging Face and Weaviate

## Dependencies

In [1]:
%%capture
%pip install -q -U weaviate-client

## Connect to Weaviate

Now, you will need to connect to a running Weaviate vector database cluster.

You can choose one of the following options:

1. **Option 1:** You can create a 14-day free sandbox on the managed service [Weaviate Cloud (WCD)](https://console.weaviate.cloud/)
2. **Option 2:** [Embedded Weaviate](https://docs.weaviate.io/deploy/installation-guides/embedded)
3. **Option 3:** [Local deployment](https://docs.weaviate.io/deploy/installation-guides/docker-installation)
4. [Other options](https://docs.weaviate.io/deploy)

In [2]:
import os

huggingface_key = os.environ["HF_TOKEN"] # Replace with your HuggingFace key
WCD_URL = os.environ["WEAVIATE_URL"] # Replace with your Weaviate cluster URL
WCD_AUTH_KEY = os.environ["WEAVIATE_API_KEY"] # Replace with your cluster auth key

# Uncomment if you are working in a Google Colab environment
#from google.colab import userdata

#huggingface_key = userdata.get('HF_TOKEN')
#WCD_URL = userdata.get("WEAVIATE_URL")
#WCD_AUTH_KEY = userdata.get("WEAVIATE_API_KEY")

headers={
    "X-huggingface-Api-Key": huggingface_key
  }


In [None]:
import weaviate

# Option 1: Weaviate Cloud
# Weaviate Cloud Deployment
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=WCD_URL,
    auth_credentials=weaviate.auth.AuthApiKey(WCD_AUTH_KEY),
    headers=headers,
)

# Option 2: Embedded Weaviate instance
# use if you want to explore Weaviate without any additional setup
#client = weaviate.connect_to_embedded(
#  headers=headers
#)

# Option 3: Locally hosted instance of Weaviate via Docker or Kubernetes
#!docker run --detach -p 8080:8080 -p 50051:50051 cr.weaviate.io/semitechnologies/weaviate:1.29.0
#client = weaviate.connect_to_local(
#  headers=headers
#)

print(client.is_ready())

In [4]:
# Check the cluster metadata to verify if the module is enabled
client.get_meta()['modules']

{'generative-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
  'name': 'Generative Search - OpenAI'},
 'qna-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
  'name': 'OpenAI Question & Answering Module'},
 'ref2vec-centroid': {},
 'reranker-cohere': {'documentationHref': 'https://txt.cohere.com/rerank/',
  'name': 'Reranker - Cohere'},
 'text2vec-cohere': {'documentationHref': 'https://docs.cohere.ai/embedding-wiki/',
  'name': 'Cohere Module'},
 'text2vec-huggingface': {'documentationHref': 'https://huggingface.co/docs/api-inference/detailed_parameters#feature-extraction-task',
  'name': 'Hugging Face Module'},
 'text2vec-openai': {'documentationHref': 'https://platform.openai.com/docs/guides/embeddings/what-are-embeddings',
  'name': 'OpenAI Module'}}

## Create a collection
> Collection stores your data and vector embeddings.

In [5]:
# Note: in practice, you shouldn't rerun this cell, as it deletes your data
# in "MyCollection", and then you need to re-import it again.
import weaviate.classes.config as wc

# Delete the collection if it already exists
if (client.collections.exists("MyCollection")):
    client.collections.delete("MyCollection")

# Create a collection
collection = client.collections.create(
    "MyCollection",
    vector_config=wc.Configure.Vectors.text2vec_huggingface( # specify the vectorizer and model type you're using
        model="sentence-transformers/all-MiniLM-L6-v2",
        wait_for_model=True,
        use_gpu=True,
        use_cache=True,
       # dimensions=512,# default: 768, possible options: 128, 256, 512
    ),
    properties=[ # defining properties (data schema) is optional
        wc.Property(name="Text", data_type=wc.DataType.TEXT),
    ]
)

print("Successfully created collection: MyCollection.")

Successfully created collection: MyCollection.


## Import the Data

In [6]:
# Set up our multi-lingual data
multilingual_data = [
    {"text": "The quick brown fox jumps over the lazy dog."},
    {"text": "Der schnelle braune Fuchs springt über den faulen Hund."},

    {"text": "Artificial intelligence is transforming industries."},
    {"text": "Künstliche Intelligenz verändert Industrien."},

    {"text": "I love to read books in my free time."},
    {"text": "Ich lese gerne Bücher in meiner Freizeit."},

    {"text": "Beautiful weather today, perfect for a walk."},
    {"text": "Schönes Wetter heute, perfekt für einen Spaziergang."},

    {"text": "The capital of Germany is Berlin."},
    {"text": "Die Hauptstadt Deutschlands ist Berlin."}

]


# Insert data objects
response = collection.data.insert_many(multilingual_data)

# Note, the `data` array contains 10 objects, which is great to call insert_many with.
# However, if you have a milion objects to insert, then you should spit them into smaller batches (i.e. 100-1000 per insert)

if (response.has_errors):
    print(response.errors)
else:
    print("Insert complete.")

Insert complete.


Quick check to see if all objects are in.

In [7]:
len(collection)

10

## Query Weaviate: Similarity Search (Text objects)

Similarity search options for text objects in **Weaviate**:

1. [near_text](https://weaviate.io/developers/weaviate/search/similarity#an-input-medium)

2. [near_object](https://weaviate.io/developers/weaviate/search/similarity#an-object)

3. [near_vector](https://weaviate.io/developers/weaviate/search/similarity#a-vector)

### nearText Example

Find a object in `MyCollection` closest to the query "What's the capital of Germany?". Limit it to only 4 responses.

In [9]:
import weaviate.classes.query as wq
import json

# note, you can reuse the collection object from the previous cell.
# Get a collection object for "JeopardyQuestion"
collection = client.collections.get("MyCollection")

query = "What's the capital of Germany?"

response = collection.query.near_text(
    query=query,
    include_vector=True,                             # return the vector embeddings
    return_metadata=wq.MetadataQuery(distance=True), # return the distance
    limit=3
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Data:", json.dumps(item.properties, indent=2))
    print("Vector:", item.vector['default'][:3], '...')
    print("Distance:", item.metadata.distance, "\n")

ID: adcdd665-87ad-45f1-85c2-dd7d6beac702
Data: {
  "text": "The capital of Germany is Berlin."
}
Vector: [0.04136650264263153, 0.08663720637559891, -0.007826615124940872] ...
Distance: 0.39156365394592285 

ID: 58a67982-224c-46eb-8f34-68c73d33d65d
Data: {
  "text": "Die Hauptstadt Deutschlands ist Berlin."
}
Vector: [-0.018270188942551613, 0.10900147259235382, -0.06833386421203613] ...
Distance: 0.5286470651626587 

ID: 74a85b70-f6ca-457f-b40d-29206f852663
Data: {
  "text": "Der schnelle braune Fuchs springt \u00fcber den faulen Hund."
}
Vector: [-0.0851321667432785, 0.06763748824596405, -0.02962086722254753] ...
Distance: 0.7293647527694702 



In [None]:
# Save last item's vector embedding and UUID for near_object and near_vector search
vector = item.vector['default']
uuid = item.uuid

### nearObject Example

Search through the `JeopardyQuestion` class to find the top 4 objects closest to id (The id was taken from the query above)

In [None]:
print(uuid)

response = collection.query.near_object(
    near_object=uuid, # replace with your id of interest
    limit=4
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Data:", item.properties, "\n")

### nearVector Example

Search through the `MyCollection` class to find the top 2 objects closest to the query vector.

In [None]:
print(vector[:5])

response = collection.query.near_vector(
    near_vector=vector, # your vector object goes here
    limit=4
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Data:", item.properties, "\n")