[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/weaviate-features/model-providers/nomic-ai/similarity_search_byov_modernbert.ipynb)

![Cover image](vector_search_modernbert_cover_image.png)

# Generating Embeddings for Vector Search with ModernBERT in Weaviate
## A 100% open source recipe 🧑‍🍳 💚
By Mary Newhauser, MLE @ Weaviate

This is a code recipe that uses [Nomic AI](https://www.nomic.ai/)'s [modernbert-embed-base](https://huggingface.co/nomic-ai/modernbert-embed-base) model to generate text embeddings for machine learning papers, inserts them into [Weaviate](https://weaviate.io/) and performs similarity search over the documents.

In this notebook, we accomplish the following:
* Load and transform the [ML-ArXiv-Papers](https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers) dataset
* Generate text embeddings for a random sample of 100 articles using `sentence-transformers` and `modernbert-embed-base`
* Perform a basic similarity search over the dataset

## About ModernBERT
[ModernBERT](https://arxiv.org/abs/2412.13663) is the biggest improvement in years to the [BERT](https://arxiv.org/abs/1810.04805) model. ModernBERT features:
* 16x longer sequence length
* Faster inference
* SOTA performance across tasks like classification and retrieval

For more information, check out Hugging Face's ModernBERT [blog post](https://huggingface.co/blog/modernbert).

## Requirements
To run this notebook, we used Python version `3.9.6` and Transformers `4.48.0`.

In [1]:
%%capture

!pip install --q transformers==4.48.0

In [2]:
%%capture
%pip install sentence-transformers
%pip install datasets
%pip install -U weaviate-client

## Load and transform dataset

In [None]:
from datasets import load_dataset

ds = load_dataset("CShorten/ML-ArXiv-Papers")

In [4]:
# Keep only "title" and "abstract" columns in train set
train_ds = ds["train"].select_columns(["title", "abstract"])


The original dataset contains over ~100k titles and abstracts for ML papers from arXiv. For this demo, we'll just take a random sample of 100 papers.

In [None]:
import random

# Set seed
random.seed(42)

# Shuffle the dataset and select the first 100 rows
subset_ds = train_ds.shuffle(seed=42).select(range(100))

# Concatenate abstract and titles
def combine_text(row):
    row["text"] = row["abstract"] + " " + row["title"]
    return row

# Apply function to entire dataset
subset_ds = subset_ds.map(combine_text)

# Print number of rows
print(f"Number of rows: {len(subset_ds)}")

## Generate embeddings with `modernbert-embed-base`
We'll use the `sentence-transformers` library to load and embed the concatenated titles and abstracts with the `modernbert-embed-base` embedding model, adding them to their own column in the original dataset.

In [None]:
from sentence_transformers import SentenceTransformer

# Load the SentenceTransformer model
model = SentenceTransformer("nomic-ai/modernbert-embed-base")

# Function to generate embeddings for a single text
def generate_embeddings(example):
    example["embeddings"] = model.encode(example["text"], reference_compile=False)
    return example

# Apply the function to the dataset using map
embeddings_ds = subset_ds.map(generate_embeddings)

Next, we'll convert the dataset to a `pandas` `DataFrame` for insertion into Weaviate.

In [None]:
import pandas as pd

# Convert HF dataset to Pandas DF
df = embeddings_ds.to_pandas()

# Take a peek at the data
df.head()

## Insert the embeddings into Weaviate
### Create and configure an embedded Weaviate collection

[Embedded Weaviate](https://weaviate.io/developers/weaviate/installation/embedded) allows you to spin up a Weaviate instance directly from your application code, without having to use a Docker container.

If you're interested in other deployment methods, like using Docker-Compose or Kubernetes, check out this [page](https://weaviate.io/developers/weaviate/installation) in the Weaviate docs.

In [None]:
import weaviate

# Connect to Weaviate
client = weaviate.connect_to_embedded()

Next, we define the collection and its properties.

In [None]:
import weaviate.classes as wvc
import weaviate.classes.config as wc
from weaviate.classes.config import Property, DataType

# Define the collection name
collection_name = "ml_papers"

# Delete the collection if it already exists
if (client.collections.exists(collection_name)):
    client.collections.delete(collection_name)

# Create the collection
collection = client.collections.create(
    collection_name,
    vectorizer_config = wvc.config.Configure.Vectorizer.none(),

    # Define properties of metadata
    properties=[
        wc.Property(
            name="text",
            data_type=wc.DataType.TEXT
        ),
        wc.Property(
            name="title",
            data_type=wc.DataType.TEXT,
            skip_vectorization=True
        ),
    ]
)

Finally, we insert the embeddings and metadata into our Weaviate collection.

In [11]:
# Insert embeddings and metadata into collection
objs = []
for i, d in enumerate(df["text"]):
    objs.append(wvc.data.DataObject(
            properties={
                "text": df["text"][i],
                "title": df["title"][i],
            },
            vector = df["embeddings"][i].tolist()
        )
    )

collection.data.insert_many(objs);

## Query the data using similarity search

Here, we perform a simple similarity search to return the most similar embedded chunks to our search query.

In [12]:
# Define query and number of results
query = "Which papers apply ML to the medical domain?"

top_n = 3

In [None]:
from weaviate.classes.query import MetadataQuery

query_embedding = model.encode(query)

results = collection.query.near_vector(
    near_vector = query_embedding,
    limit=top_n
)

print(f"Top {top_n} results:\n")
for i, obj in enumerate(results.objects):
    print(obj.properties['title'])
    print("\n")

☁️ Want to scale this notebook? 

😍 Get 14 days of free access to Weaviate Cloud's Sandbox by creating an account [here](https://console.weaviate.cloud/). 

*No name, no credit card required.*