[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/weaviate-features/model-providers/huggingface/similarity_search_mmBERT.ipynb)

## Similarity search with Multilingual ModernBERT (mmBERT) via Hugging Face and Weaviate

- Read more on the [blog](https://huggingface.co/blog/mmbert)
- Check out the model: [`mmBERT-small`](https://huggingface.co/jhu-clsp/mmBERT-small)

## Dependencies

In [9]:
%%capture
%pip install -q -U weaviate-client

## Connect to Weaviate

This Notebook uses the [`text2vec-transformers`](https://docs.weaviate.io/weaviate/model-providers/transformers/embeddings) module, which is **only** available through Weaviate open-source via [Docker](https://docs.weaviate.io/deploy/installation-guides/docker-installation) or [Kubernetes](https://docs.weaviate.io/deploy/installation-guides/k8s-installation). This integration is **not** available for Weaviate Cloud (WCD) serverless instances, as it requires spinning up a container with the Hugging Face model.

This notebook uses a [pre-built transformers model container](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-transformers#pre-built-images). For this, create a `docker-compose.yml` file with the following contents:
```
---
services:
  weaviate:
    command:
    - --host
    - 0.0.0.0
    - --port
    - '8080'
    - --scheme
    - http
    image: cr.weaviate.io/semitechnologies/weaviate:1.32.2
    ports:
    - 8080:8080
    - 50051:50051
    restart: on-failure:0
    environment:
      TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080'
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
      ENABLE_MODULES: 'text2vec-transformers'
      CLUSTER_HOSTNAME: 'weaviate-0'
  t2v-transformers:
    image: cr.weaviate.io/semitechnologies/transformers-inference:jhu-clsp-mmBERT-small
    environment:
      ENABLE_CUDA: '0'
...
```
and start up Docker container with
```
docker-compose up -d
```

In [11]:
import weaviate

# Connect to your local Weaviate instance deployed with Docker
client = weaviate.connect_to_local()

print(client.is_ready())

INFO:weaviate-client:Binary /root/.cache/weaviate-embedded did not exist. Downloading binary from https://github.com/weaviate/weaviate/releases/download/v1.30.5/weaviate-v1.30.5-Linux-amd64.tar.gz
INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 1593


True


In [12]:
# Check the cluster metadata to verify if the module is enabled
client.get_meta()['modules']

{'generative-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
  'name': 'Generative Search - OpenAI'},
 'qna-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
  'name': 'OpenAI Question & Answering Module'},
 'ref2vec-centroid': {},
 'reranker-cohere': {'documentationHref': 'https://txt.cohere.com/rerank/',
  'name': 'Reranker - Cohere'},
 'text2vec-cohere': {'documentationHref': 'https://docs.cohere.ai/embedding-wiki/',
  'name': 'Cohere Module'},
 'text2vec-huggingface': {'documentationHref': 'https://huggingface.co/docs/api-inference/detailed_parameters#feature-extraction-task',
  'name': 'Hugging Face Module'},
 'text2vec-openai': {'documentationHref': 'https://platform.openai.com/docs/guides/embeddings/what-are-embeddings',
  'name': 'OpenAI Module'}}

## Create a collection
> Collection stores your data and vector embeddings.

In [13]:
# Note: in practice, you shouldn't rerun this cell, as it deletes your data
# in "MyCollection", and then you need to re-import it again.
import weaviate.classes.config as wc

# Delete the collection if it already exists
if (client.collections.exists("MyCollection")):
    client.collections.delete("MyCollection")

# Create a collection
collection = client.collections.create(
    "MyCollection",
    vector_config=wc.Configure.Vectors..text2vec_transformers(
        dimensions=512, # default: 768, possible options: 128, 256, 512
    ),
    properties=[ # defining properties (data schema) is optional
        wc.Property(name="Text", data_type=wc.DataType.TEXT),
    ]
)

print("Successfully created collection: MyCollection.")

Successfully created collection: MyCollection.


## Import the Data

In [14]:
# Set up our multi-lingual data
multilingual_data = [
    {"text": "The quick brown fox jumps over the lazy dog."},
    {"text": "Der schnelle braune Fuchs springt über den faulen Hund."},

    {"text": "Artificial intelligence is transforming industries."},
    {"text": "Künstliche Intelligenz verändert Industrien."},

    {"text": "I love to read books in my free time."},
    {"text": "Ich lese gerne Bücher in meiner Freizeit."},

    {"text": "Beautiful weather today, perfect for a walk."},
    {"text": "Schönes Wetter heute, perfekt für einen Spaziergang."},

    {"text": "The capital of Germany is Berlin."},
    {"text": "Die Hauptstadt Deutschlands ist Berlin."}

]


# Insert data objects
response = collection.data.insert_many(multilingual_data)

# Note, the `data` array contains 10 objects, which is great to call insert_many with.
# However, if you have a milion objects to insert, then you should spit them into smaller batches (i.e. 100-1000 per insert)

if (response.has_errors):
    print(response.errors)
else:
    print("Insert complete.")

WeaviateInsertManyAllFailedError: Every object failed during insertion. Here is the set of all errors: unmarshal error response body: Not Found

Quick check to see if all objects are in.

In [None]:
len(collection)

## Query Weaviate: Similarity Search (Text objects)

Similarity search options for text objects in **Weaviate**:

1. [near_text](https://weaviate.io/developers/weaviate/search/similarity#an-input-medium)

2. [near_object](https://weaviate.io/developers/weaviate/search/similarity#an-object)

3. [near_vector](https://weaviate.io/developers/weaviate/search/similarity#a-vector)

### nearText Example

Find a object in `MyCollection` closest to the query "What's the capital of Germany?". Limit it to only 4 responses.

In [None]:
import weaviate.classes.query as wq
import json

# note, you can reuse the collection object from the previous cell.
# Get a collection object for "JeopardyQuestion"
collection = client.collections.get("MyCollection")

query = "What's the capital of Germany?"

response = collection.query.near_text(
    query=query,
    include_vector=True,                             # return the vector embeddings
    return_metadata=wq.MetadataQuery(distance=True), # return the distance
    limit=3
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Data:", json.dumps(item.properties, indent=2))
    print("Vector:", item.vector['default'][:3], '...')
    print("Distance:", item.metadata.distance, "\n")

In [None]:
# Save last item's vector embedding and UUID for near_object and near_vector search
vector = item.vector['default']
uuid = item.uuid

### nearObject Example

Search through the `JeopardyQuestion` class to find the top 4 objects closest to id (The id was taken from the query above)

In [None]:
print(uuid)

response = collection.query.near_object(
    near_object=uuid, # replace with your id of interest
    limit=4
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Data:", item.properties, "\n")

### nearVector Example

Search through the `MyCollection` class to find the top 2 objects closest to the query vector.

In [None]:
print(vector[:5])

response = collection.query.near_vector(
    near_vector=vector, # your vector object goes here
    limit=4
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Data:", item.properties, "\n")