# Similarity search with Weaviate's `text2vec-transformers` module


## Dependencies

In [None]:
!pip install weaviate-client

## Connect to Weaviate

[`text2vec-transformers`](https://docs.weaviate.io/weaviate/model-providers/transformers/embeddings) is **only** available through Weaviate open-source via [Docker](https://docs.weaviate.io/deploy/installation-guides/docker-installation) or [Kubernetes](https://docs.weaviate.io/deploy/installation-guides/k8s-installation). This integration is **not** available for Weaviate Cloud (WCD) serverless instances, as it requires spinning up a container with the Hugging Face model.

Here are options to select your desired model: 

1. [Pre-built transformers model containers](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-transformers#pre-built-images)

    Example `docker-compose.yml`
    ```
    ---
    services:
    weaviate:
        command:
        - --host
        - 0.0.0.0
        - --port
        - '8080'
        - --scheme
        - http
        image: cr.weaviate.io/semitechnologies/weaviate:1.30.0
        ports:
        - 8080:8080
        - 50051:50051
        restart: on-failure:0
        environment:
        TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080'
        QUERY_DEFAULTS_LIMIT: 25
        AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
        PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
        DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
        ENABLE_MODULES: 'text2vec-transformers'
        CLUSTER_HOSTNAME: 'node1'
    t2v-transformers:
        image: cr.weaviate.io/semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
        environment:
        ENABLE_CUDA: '0'
    ...
    ```
    and start up Docker container with
    ```
    docker-compose up -d
    ```

2. [Any model from Hugging Face Model Hub](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-transformers#option-2-use-any-publicly-available-hugging-face-model)

3. [Use any private or local PyTorch or Tensorflow transformer model](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-transformers#option-3-custom-build-with-a-private-or-local-model)

In [None]:
import weaviate
from weaviate.config import AdditionalConfig

# Connect to your local Weaviate instance deployed with Docker
client = weaviate.connect_to_local(
    additional_config=AdditionalConfig(timeout=[900,1200])
)

client.is_ready()

In [3]:
# Check the cluster metadata to verify if the module is enabled
client.get_meta()['modules']

{'text2vec-transformers': {'model': {'_name_or_path': './models/model',
   'add_cross_attention': False,
   'architectures': ['BertModel'],
   'attention_probs_dropout_prob': 0.1,
   'bad_words_ids': None,
   'begin_suppress_tokens': None,
   'bos_token_id': None,
   'chunk_size_feed_forward': 0,
   'classifier_dropout': None,
   'cross_attention_hidden_size': None,
   'decoder_start_token_id': None,
   'diversity_penalty': 0,
   'do_sample': False,
   'early_stopping': False,
   'encoder_no_repeat_ngram_size': 0,
   'eos_token_id': None,
   'exponential_decay_length_penalty': None,
   'finetuning_task': None,
   'forced_bos_token_id': None,
   'forced_eos_token_id': None,
   'gradient_checkpointing': False,
   'hidden_act': 'gelu',
   'hidden_dropout_prob': 0.1,
   'hidden_size': 384,
   'id2label': {'0': 'LABEL_0', '1': 'LABEL_1'},
   'initializer_range': 0.02,
   'intermediate_size': 1536,
   'is_decoder': False,
   'is_encoder_decoder': False,
   'label2id': {'LABEL_0': 0, 'LABEL_1

## Create a collection
> Collection stores your data and vector embeddings.

In [4]:
# Note: in practice, you shouldn't rerun this cell, as it deletes your data
# in "JeopardyQuestion", and then you need to re-import it again.
import weaviate.classes.config as wc

# Delete the collection if it already exists
if (client.collections.exists("JeopardyQuestion")):
    client.collections.delete("JeopardyQuestion")

client.collections.create(
    name="JeopardyQuestion",

    vector_config=wc.Configure.Vectors.text2vec_transformers( # specify the vectorizer and model type you're using
        # pooling_strategy="masked_mean"
    ),

    properties=[ # defining properties (data schema) is optional
        wc.Property(name="Question", data_type=wc.DataType.TEXT), 
        wc.Property(name="Answer", data_type=wc.DataType.TEXT),
        wc.Property(name="Category", data_type=wc.DataType.TEXT, skip_vectorization=True), 
    ]
)

print("Successfully created collection: JeopardyQuestion.")

Successfully created collection: JeopardyQuestion.


## Import the Data

In [5]:
import requests, json
url = 'https://raw.githubusercontent.com/weaviate/weaviate-examples/main/jeopardy_small_dataset/jeopardy_tiny.json'
resp = requests.get(url)
data = json.loads(resp.text)

# Get a collection object for "JeopardyQuestion"
jeopardy = client.collections.get("JeopardyQuestion")

# Insert data objects
with jeopardy.batch.fixed_size(
    batch_size=2,
    # concurrent_requests=1
) as batch:
    for item in data:
        print("Adding", item)
        batch.add_object(item)

print("Import Complete")

if (len(jeopardy.batch.failed_objects) > 0):
    print("There were some errors")
    for fo in jeopardy.batch.failed_objects:
        print(fo)

Adding {'Category': 'SCIENCE', 'Question': 'This organ removes excess glucose from the blood & stores it as glycogen', 'Answer': 'Liver'}
Adding {'Category': 'ANIMALS', 'Question': "It's the only living mammal in the order Proboseidea", 'Answer': 'Elephant'}
Adding {'Category': 'ANIMALS', 'Question': 'The gavial looks very much like a crocodile except for this bodily feature', 'Answer': 'the nose or snout'}
Adding {'Category': 'ANIMALS', 'Question': 'Weighing around a ton, the eland is the largest species of this animal in Africa', 'Answer': 'Antelope'}
Adding {'Category': 'ANIMALS', 'Question': 'Heaviest of all poisonous snakes is this North American rattlesnake', 'Answer': 'the diamondback rattler'}
Adding {'Category': 'SCIENCE', 'Question': "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification", 'Answer': 'species'}
Adding {'Category': 'SCIENCE', 'Question': 'A metal that is "ductile" can be pulled into this while cold & u



Import Complete


In [6]:
jeopardy = client.collections.get("JeopardyQuestion")
print(jeopardy.aggregate.over_all())

AggregateReturn(properties={}, total_count=10)


## Query Weaviate: Similarity Search (Text objects)

Similarity search options for text objects in **Weaviate**:

1. [near_text](https://weaviate.io/developers/weaviate/search/similarity#an-input-medium)

2. [near_object](https://weaviate.io/developers/weaviate/search/similarity#an-object)

3. [near_vector](https://weaviate.io/developers/weaviate/search/similarity#a-vector)

### nearText Example

Find a `JeopardyQuestion` about "animals in movies". Limit it to only 4 responses.

In [7]:
# note, you can reuse the collection object from the previous cell.
# Get a collection object for "JeopardyQuestion"
jeopardy = client.collections.get("JeopardyQuestion")

response = jeopardy.query.near_text(
    query="african beasts",
    limit=4
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Data:", json.dumps(item.properties, indent=2), "\n")

ID: 8a00fbd2-45fa-48f2-8854-0ed373f03613
Data: {
  "question": "Weighing around a ton, the eland is the largest species of this animal in Africa",
  "answer": "Antelope",
  "category": "ANIMALS"
} 

ID: 7727cb1f-a08d-40b0-9da6-e4746f3337fa
Data: {
  "question": "The gavial looks very much like a crocodile except for this bodily feature",
  "answer": "the nose or snout",
  "category": "ANIMALS"
} 

ID: 3e1b444f-dbb0-4108-8c6e-eb2d6a7ceff5
Data: {
  "question": "It's the only living mammal in the order Proboseidea",
  "answer": "Elephant",
  "category": "ANIMALS"
} 

ID: aac57491-eced-4033-866c-a0ca20f20a35
Data: {
  "question": "Heaviest of all poisonous snakes is this North American rattlesnake",
  "answer": "the diamondback rattler",
  "category": "ANIMALS"
} 



Return vector embeddings.

In [8]:
response = jeopardy.query.near_text(
    query="african beasts",
    include_vector=True,
    limit=4
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Data:", json.dumps(item.properties, indent=2))
    print("Vector:", item.vector, "\n")

ID: 8a00fbd2-45fa-48f2-8854-0ed373f03613
Data: {
  "question": "Weighing around a ton, the eland is the largest species of this animal in Africa",
  "answer": "Antelope",
  "category": "ANIMALS"
}
Vector: {'default': [-0.11791916936635971, 0.6624627113342285, 0.18456493318080902, -0.08122072368860245, -0.49247342348098755, -0.5737257599830627, -0.15386492013931274, 0.3993226885795593, -0.3364018201828003, 0.9105186462402344, -0.40116187930107117, -0.44404107332229614, -0.23901110887527466, 0.10679106414318085, -0.13704107701778412, -0.09618320316076279, 0.1664130538702011, -0.04804283007979393, 0.01201174221932888, 0.01210771780461073, 0.19907936453819275, 0.038498226553201675, 0.06976787745952606, 0.12092171609401703, -0.6761063933372498, -0.41834601759910583, -0.5966598391532898, 0.520249605178833, 0.2846285104751587, -0.630042552947998, -0.0729237049818039, 0.3715086281299591, 0.6463991403579712, 0.15955376625061035, -0.4426628351211548, -0.16930973529815674, -0.33322933316230774, -

In [9]:
# Save last item's vector embedding and UUID for near_object and near_vector search
vector = item.vector['default']
uuid = item.uuid

Now, also request the `distance` for each returned item.

In [10]:
import weaviate.classes.query as wq

response = jeopardy.query.near_text(
    query="african beasts",
    return_metadata=wq.MetadataQuery(distance=True),
    limit=4
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Distance:", item.metadata.distance)
    print("Data:", item.properties, "\n")

ID: 8a00fbd2-45fa-48f2-8854-0ed373f03613
Distance: 0.46423041820526123
Data: {'question': 'Weighing around a ton, the eland is the largest species of this animal in Africa', 'answer': 'Antelope', 'category': 'ANIMALS'} 

ID: 7727cb1f-a08d-40b0-9da6-e4746f3337fa
Distance: 0.5695910453796387
Data: {'question': 'The gavial looks very much like a crocodile except for this bodily feature', 'answer': 'the nose or snout', 'category': 'ANIMALS'} 

ID: 3e1b444f-dbb0-4108-8c6e-eb2d6a7ceff5
Distance: 0.5756919980049133
Data: {'question': "It's the only living mammal in the order Proboseidea", 'answer': 'Elephant', 'category': 'ANIMALS'} 

ID: aac57491-eced-4033-866c-a0ca20f20a35
Distance: 0.7628661394119263
Data: {'question': 'Heaviest of all poisonous snakes is this North American rattlesnake', 'answer': 'the diamondback rattler', 'category': 'ANIMALS'} 



### nearObject Example

Search through the `JeopardyQuestion` class to find the top 4 objects closest to UUID, we saved earlier.

In [14]:
print(f"Retrieve objects closest to item with UUID: {uuid}\n")

response = jeopardy.query.near_object(
    near_object=uuid,#"a1dd67f9-bfa7-45e1-b45e-26eb8c52e9a6", # replace with your id of interest
    limit=4
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Data:", item.properties, "\n")

Retrieve objects closest to item with UUID: aac57491-eced-4033-866c-a0ca20f20a35

ID: aac57491-eced-4033-866c-a0ca20f20a35
Data: {'question': 'Heaviest of all poisonous snakes is this North American rattlesnake', 'answer': 'the diamondback rattler', 'category': 'ANIMALS'} 

ID: 8a00fbd2-45fa-48f2-8854-0ed373f03613
Data: {'question': 'Weighing around a ton, the eland is the largest species of this animal in Africa', 'answer': 'Antelope', 'category': 'ANIMALS'} 

ID: 3e1b444f-dbb0-4108-8c6e-eb2d6a7ceff5
Data: {'question': "It's the only living mammal in the order Proboseidea", 'answer': 'Elephant', 'category': 'ANIMALS'} 

ID: 7727cb1f-a08d-40b0-9da6-e4746f3337fa
Data: {'question': 'The gavial looks very much like a crocodile except for this bodily feature', 'answer': 'the nose or snout', 'category': 'ANIMALS'} 



### nearVector Example

Search through the `JeopardyQuestion` class to find the top 2 objects closest to the query vector, we saved earlier.

In [17]:
print(f"Retrieve objects closest to item with vector: {vector[:5]}...\n")

response = jeopardy.query.near_vector(
    near_vector=vector, # your vector object goes here
    limit=4
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Data:", item.properties, "\n")

Retrieve objects closest to item with vector: [0.09574556350708008, -0.13571687042713165, -0.2824910283088684, -0.04742668569087982, 0.03194054961204529]...

ID: aac57491-eced-4033-866c-a0ca20f20a35
Data: {'question': 'Heaviest of all poisonous snakes is this North American rattlesnake', 'answer': 'the diamondback rattler', 'category': 'ANIMALS'} 

ID: 8a00fbd2-45fa-48f2-8854-0ed373f03613
Data: {'question': 'Weighing around a ton, the eland is the largest species of this animal in Africa', 'answer': 'Antelope', 'category': 'ANIMALS'} 

ID: 3e1b444f-dbb0-4108-8c6e-eb2d6a7ceff5
Data: {'question': "It's the only living mammal in the order Proboseidea", 'answer': 'Elephant', 'category': 'ANIMALS'} 

ID: 7727cb1f-a08d-40b0-9da6-e4746f3337fa
Data: {'question': 'The gavial looks very much like a crocodile except for this bodily feature', 'answer': 'the nose or snout', 'category': 'ANIMALS'} 

