[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_weaviate.ipynb)

# RAG with Weaviate


This is a code sample that uses [Weaviate](https://weaviate.io/) to perform RAG over PDF documents parsed by [Docling](https://ds4sd.github.io/docling/). Table below outlines the LLM model used and whether it is local or remote component being used

| Step | LLM Model | Execution  - Local or Remote|
| --- | --- | --- |
| Embedding | Open AI | 🌐 Remote |
| Vector store | Weavieate |🌐 Remote |
| Gen AI | Open AI | 🌐 Remote |

In this example, we accomplish the following tasks:
* Parse the top machine learning papers on [arXiv](https://arxiv.org/) using Docling
* Perform hierarchical chunking of the documents using Docling
* Generate text embeddings with OpenAI
* Perform RAG using [Weaviate](https://weaviate.io/developers/weaviate/search/generative)

In this example, we will accomplish the following tasks:
* We will parse the top machine learning papers on arXiv using Docling. After that we will [erform hierarchical chunking of the documents using Docling HierarchicalChunker
Then Generate text embeddings with OpenAI
* In the end we will perform RAG using Weaviate vector store
To run this sample, you'll need:the following

To run this notebook, you'll need:
* An [OpenAI API key](https://platform.openai.com/docs/quickstart)
* Access to GPU/s

Note: For best results, please use **GPU acceleration** to run this notebook. Here are two options for running this notebook:
1. **Locally on a MacBook with an Apple Silicon chip.** Docling's has a MPS accelerators for macbook
2. **Run this code on Google Colab.** Convert all documents in the notebook takes about8 mintutes on a Google Colab T4 GPU.


Run this notebook on Google Colab. Converting all documents in the notebook takes approximately~8 mintutes on a Google Colab T4 GPU.

### Install Docling and Weaviate client

Note: If Colab prompts you to restart the session after running the cell below, click "restart" and proceed with running the rest of the notebook.

In [1]:
%%capture
%pip install docling~="2.25.2"
%pip install -U weaviate-client~="4.11.1"
%pip install rich
%pip install torch

import warnings

warnings.filterwarnings("ignore")

import logging

# we will change the log level for Weaviate client
logging.getLogger("weaviate").setLevel(logging.ERROR)

## Docling

Docling can run on commodity hardware. In our case we ran it on Google colab where Tesla T4 was the GPU with cuda enabled. On local Macbook it integrates with  b
Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.

The code below is provided to checks to see if a GPU is available, either of CUDA or MPS.

In [2]:
import torch

# Check if GPU or MPS is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("MPS GPU is enabled.")
else:
    raise EnvironmentError(
        "No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured."
    )

CUDA GPU is enabled: Tesla T4


Here, we've collected 10 influential machine learning papers published as PDFs on arXiv. Because Docling does not yet have title extraction for PDFs, we manually add the titles in a corresponding list.

Note: Converting all 10 papers should take around 8 minutes with a T4 GPU.

In [2]:
# Influential machine learning papers
arxiv_urls = [
    "http://arxiv.org/abs/2303.08774v3",
    "http://arxiv.org/abs/2307.09288v2",
    "http://arxiv.org/abs/2302.13971v1",
    "http://arxiv.org/abs/2303.12712v5",
    "http://arxiv.org/abs/2306.05685v3",
    "http://arxiv.org/abs/2301.12597v3",
    "http://arxiv.org/abs/2304.02643v1",
    "http://arxiv.org/abs/2305.10403v3",
    "http://arxiv.org/abs/2306.01116v1",
    "http://arxiv.org/abs/2303.03378v1",
]

print(arxiv_urls)

source_titles =[
    "GPT-4 Technical Report",
    "Llama 2: Open Foundation and Fine-Tuned Chat Models",
    "LLaMA: Open and Efficient Foundation Language Models",
    "Sparks of Artificial General Intelligence: Early experiments with GPT-4",
    "Judging LLM-as-a-Judge with MTBench and Chatbot Arena",
    "BLIP-2: Bootstrapping LanguageImage Pre-training with Frozen Image Encoders and Large Language Models",
    "Segment Anything",
    "PaLM 2 Technical Report",
    "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only",
    "PaLM-E: An Embodied Multimodal Language Model"
]


['http://arxiv.org/abs/2303.08774v3', 'http://arxiv.org/abs/2307.09288v2', 'http://arxiv.org/abs/2302.13971v1', 'http://arxiv.org/abs/2303.12712v5', 'http://arxiv.org/abs/2306.05685v3', 'http://arxiv.org/abs/2301.12597v3', 'http://arxiv.org/abs/2304.02643v1', 'http://arxiv.org/abs/2305.10403v3', 'http://arxiv.org/abs/2306.01116v1', 'http://arxiv.org/abs/2303.03378v1']


# And their corresponding titles (because Docling doesn't have title extraction yet!)
```
source_titles = [
    "Attention Is All You Need",
    "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding",
    "Generative Adversarial Nets",
    "Neural Machine Translation by Jointly Learning to Align and Translate",
    "Adam: A Method for Stochastic Optimization",
    "Auto-Encoding Variational Bayes",
    "Playing Atari with Deep Reinforcement Learning",
    "Deep Residual Learning for Image Recognition",
    "Sequence to Sequence Learning with Neural Networks",
    "A Neural Probabilistic Language Model",
]
```

### Convert PDFs to Docling documents

Here we use Docling's `.convert_all()` to parse a batch of PDFs. The result is a list of Docling documents that we can use for text extraction.

Note: Please ignore the `ERR#` message.

In [3]:
from docling.datamodel.document import ConversionResult
from docling.document_converter import DocumentConverter

# Instantiate the doc converter
doc_converter = DocumentConverter()

# Directly pass list of files or streams to `convert_all`
conv_results_iter = doc_converter.convert_all(arxiv_urls)  # previously `convert`

# Iterate over the generator to get a list of Docling documents
docs = [result.document for result in conv_results_iter]

### Post-process extracted document data
#### Perform hierarchical chunking on documents

We use Docling's `HierarchicalChunker()` to perform hierarchy-aware chunking of our list of documents. This is meant to preserve some of the structure and relationships within the document, which enables more accurate and relevant retrieval in our RAG pipeline.

In [4]:
from docling_core.transforms.chunker import HierarchicalChunker

# Initialize lists for text, and titles
texts, titles = [], []

chunker = HierarchicalChunker()

# Process each document in the list
for doc, title in zip(docs, source_titles):  # Pair each document with its title
    chunks = list(
        chunker.chunk(doc)
    )  # Perform hierarchical chunking and get text from chunks
    for chunk in chunks:
        texts.append(chunk.text)
        titles.append(title)

Because we're splitting the documents into chunks, we'll concatenate the article title to the beginning of each chunk for additional context.

In [5]:
# Concatenate title and text
for i in range(len(texts)):
    texts[i] = f"{titles[i]} {texts[i]}"

## 💚 Part 2: Weaviate
### Create and configure an embedded Weaviate collection

We'll be using the OpenAI API for both generating the text embeddings and for the generative model in our RAG pipeline. The code below dynamically fetches your API key based on whether you're running this notebook in Google Colab and running it as a regular Jupyter notebook. All you need to do is replace `openai_api_key_var` with the name of your environmental variable name or Colab secret name for the API key.

If you're running this notebook in Google Colab, make sure you [add](https://medium.com/@parthdasawant/how-to-use-secrets-in-google-colab-450c38e3ec75) your API key as a secret.

In [6]:
# OpenAI API key variable name
openai_api_key_var = "OPENAI_API_KEY"  # Replace with the name of your secret/env var

# Fetch OpenAI API key
try:
    # If running in Colab, fetch API key from Secrets
    import google.colab
    from google.colab import userdata

    openai_api_key = userdata.get(openai_api_key_var)
    if not openai_api_key:
        raise ValueError(f"Secret '{openai_api_key_var}' not found in Colab secrets.")
except ImportError:
    # If not running in Colab, fetch API key from environment variable
    import os

    openai_api_key = os.getenv(openai_api_key_var)
    if not openai_api_key:
        raise EnvironmentError(
            f"Environment variable '{openai_api_key_var}' is not set. "
            "Please define it before running this script."
        )

In [8]:
from google.colab import userdata

WEAVIATE_URL=userdata.get('WEAVIATE_URL')
WEAVIATE_API_KEY = userdata.get('WEAVIATE_API_KEY')

In [9]:
import weaviate
from weaviate.classes.init import Auth
from weaviate.classes.init import AdditionalConfig, Timeout
# Best practice: store your credentials in environment variables
weaviate_url = WEAVIATE_URL #os.environ["WEAVIATE_URL"]
weaviate_api_key = WEAVIATE_API_KEY

# Connect to Weaviate Cloud
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=weaviate_url,
    auth_credentials=Auth.api_key(weaviate_api_key),headers={"X-OpenAI-Api-Key": openai_api_key},
    additional_config=AdditionalConfig(
        timeout=Timeout(init=30, query=60, insert=120)  # Values in seconds
    )
)

print(client.is_ready())

True


[Embedded Weaviate](https://weaviate.io/developers/weaviate/installation/embedded) allows you to spin up a Weaviate instance directly from your application code, without having to use a Docker container. If you're interested in other deployment methods, like using Docker-Compose or Kubernetes, check out this [page](https://weaviate.io/developers/weaviate/installation) in the Weaviate docs.

In [10]:
import weaviate

# Connect to Weaviate embedded
#client = weaviate.connect_to_embedded(headers={"X-OpenAI-Api-Key": openai_api_key})

In [11]:
import weaviate.classes.config as wc
from weaviate.classes.config import DataType, Property

# Define the collection name
collection_name = "docling_latest_papers"

# Delete the collection if it already exists
if client.collections.exists(collection_name):
    client.collections.delete(collection_name)

# Create the collection
collection = client.collections.create(
    name=collection_name,
    vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-large",  # Specify your embedding model here
    ),
    # Enable generative model from Cohere
    generative_config=wc.Configure.Generative.openai(
        model="gpt-4o"  # Specify your generative model for RAG here
    ),
    # Define properties of metadata
    properties=[
        wc.Property(name="text", data_type=wc.DataType.TEXT),
        wc.Property(name="title", data_type=wc.DataType.TEXT, skip_vectorization=True),
    ],
)

### Wrangle data into an acceptable format for Weaviate

Transform our data from lists to a list of dictionaries for insertion into our Weaviate collection.

In [12]:
# Initialize the data object
data = []

# Create a dictionary for each row by iterating through the corresponding lists
for text, title in zip(texts, titles):
    data_point = {
        "text": text,
        "title": title,
    }
    data.append(data_point)

In [13]:
data_rows = [
    {"title": f"Object {i+1}"} for i in range(5)
]

#collection = client.collections.get("MyCollection")

with collection.batch.dynamic() as batch:
    for data_row in data:
        batch.add_object(
            properties=data_row,
        )
        if batch.number_errors > 10:
            print("Batch import stopped due to excessive errors.")
            break

failed_objects = collection.batch.failed_objects
if failed_objects:
    print(f"Number of failed imports: {len(failed_objects)}")
    print(f"First failed object: {failed_objects[0]}")

### Insert data into Weaviate and generate embeddings

Embeddings will be generated upon insertion to our Weaviate collection.

In [14]:
# Insert text chunks and metadata into vector DB collection
# response = collection.data.insert_many(data)

#if response.has_errors:
#    print(response.errors)
#else:
#.   print("Insert complete.")

### Query the data

Here, we perform a simple similarity search to return the most similar embedded chunks to our search query.

In [15]:
from weaviate.classes.query import MetadataQuery

response = collection.query.near_text(
    query="GPT-4",
    limit=2,
    return_metadata=MetadataQuery(distance=True),
    return_properties=["text", "title"],
)

for o in response.objects:
    print(o.properties)
    print(o.metadata.distance)

{'text': 'GPT-4 Technical Report ()', 'title': 'GPT-4 Technical Report'}
0.39860624074935913
{'text': 'GPT-4 Technical Report (cs)', 'title': 'GPT-4 Technical Report'}
0.4060209393501282


### Perform RAG on parsed articles

Weaviate's `generate` module allows you to perform RAG over your embedded data without having to use a separate framework.

We specify a prompt that includes the field we want to search through in the database (in this case it's `text`), a query that includes our search term, and the number of retrieved results to use in the generation.

In [17]:
from rich.console import Console
from rich.panel import Panel

# Create a prompt where context from the Weaviate collection will be injected
prompt = "Explain how {text} is doing on various benchmarks, using only the retrieved context."
query = "GPT-4"

response = collection.generate.near_text(
    query=query, limit=3, grouped_task=prompt, return_properties=["text", "title"]
)

# Prettify the output using Rich
console = Console()

console.print(
    Panel(f"{prompt}".replace("{text}", query), title="Prompt", border_style="bold red")
)
console.print(
    Panel(response.generated, title="Generated Content", border_style="bold green")
)

In [22]:
# Create a prompt where context from the Weaviate collection will be injected
prompt = "Explain how {text} has been trained."
query = "LLAMA"

response = collection.generate.near_text(
    query=query, limit=3, grouped_task=prompt, return_properties=["text", "title"]
)

# Prettify the output using Rich
console = Console()

console.print(
    Panel(f"{prompt}".replace("{text}", query), title="Prompt", border_style="bold red")
)
console.print(
    Panel(response.generated, title="Generated Content", border_style="bold green")
)

We can see that our RAG pipeline performs relatively well for simple queries, especially given the small size of the dataset. Scaling this method for converting a larger sample of PDFs would require more compute (GPUs) and a more advanced deployment of Weaviate (like Docker, Kubernetes, or Weaviate Cloud). For more information on available Weaviate configurations, check out the [documetation](https://weaviate.io/developers/weaviate/starter-guides/which-weaviate).