[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/integrations/operations/cleanlab/rag_with_weaviate_and_cleanlab.ipynb)

# Deploy Trustworthy RAG with Weaviate and Cleanlab

Large Language Models (LLMs) occasionally hallucinate incorrect answers, especially for questions not well-supported within their training data. Retrieval Augmented Generation (RAG) mitigates this by supplying LLMs with context retrieved from knowledge databases like Weaviate. While organizations are rapidly adopting RAG to pair the power of LLMs with their own proprietary data, hallucinated/incorrect RAG responses remain a problem.

This tutorial demonstrates an easy solution: build *trustworthy* RAG applications using Weaviate and Cleanlab to mitigate hallucinated/incorrect responses.

Cleanlab boosts the reliability of any LLM application by scoring when LLM responses are untrusworthy. Trust scoring happens in real-time and does not require any data labeling or model training work. Cleanlab provides additional real-time Evals for other RAG components like the retrieved context, which help you diagnose *why* certain RAG responses and inaccurate/bad. Cleanlab's detection enables you to automatically flag/prevent inaccurate/bad responses from your RAG app, and avoid losing your users' trust.

Weaviate streamlines RAG application development, providing a scalable vector database to store your organization's knowledge, as well vectorizers and integrations with generative model providers (like OpenAI & Anthropic). By providing all of the necessary components via a simple developer experience, Weaviate enables you to quickly deploy a performant RAG application. Adding Cleanlab on top provides an additional layer of trust.

## Setup

For this tutorial, we will need:
- Weaviate Cloud Cluster: A managed vector database (DB) instance provided by Weaviate. Follow the [quickstart](https://weaviate.io/developers/wcs/quickstart) to create one
- Cleanlab API Key: Sign up at [tlm.cleanlab.ai/](https://tlm.cleanlab.ai/) to get a free key
- OpenAI API Key: To make completion requests to an LLM

Start by installing the required dependencies. We use the following versions for developing this notebook:

```
Weaviate Database Version == 1.30.1
weaviate-client==4.11.1
cleanlab_tlm==1.1.0
```

In [None]:
%pip install weaviate-client cleanlab_tlm

In [1]:
import os, re
from typing import List
import pandas as pd

import weaviate
import weaviate.classes as wvc
import weaviate.classes.config as wc

from cleanlab_tlm import TrustworthyRAG, Eval, get_default_evals
from openai import OpenAI

Set the retrieved keys as environment variables

In [None]:
os.environ['WCD_DEMO_URL'] = '<your-weaviate-instance-url>'
os.environ['WCD_DEMO_RO_KEY'] = '<your-weaviate-instance-key>'
os.environ['OPENAI_API_KEY'] = '<your-openai-key>'
os.environ['CLEANLAB_API_KEY'] = '<your-cleanlab-key>'

In [3]:
# Setup connection with Weaviate's cloud instance

wcd_url = os.environ["WCD_DEMO_URL"]    # Create a cluster in WCD and get the URL
wcd_api_key = os.environ["WCD_DEMO_RO_KEY"] # Get it for free from the cluster's setting
openai_api_key = os.environ["OPENAI_API_KEY"]   # Get it from OpenAI's app portal

headers = {
    "X-OpenAI-Api-Key": openai_api_key,
}

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=wcd_url,                                    
    auth_credentials=wvc.init.Auth.api_key(wcd_api_key),    
    headers=headers
)

print(client.is_ready())

True


Now, we initialize Cleanlab's object with default settings. You can achieve better results or lower latency by adjusting the [configurations](https://help.cleanlab.ai/tlm/faq/#recommended-tlm-configurations-to-try).

In [None]:
openai = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

cleanlab_api_key = os.environ["CLEANLAB_API_KEY"]   # Sign up at Cleanlab's developer portal to get a free key
trustworthy_rag = TrustworthyRAG(api_key=cleanlab_api_key)  # Optional configurations can improve accuracy/latency

## Create collection in vector database

A collection in Weaviate is synonymous with an index in a database. Here, we will instantiate the collection by specifying two key configurations:
1. **Vectorizer / Embedding Model** - Weaviate offers in-house managed embedding service and third-party integrations (like `text-embedding` from OpenAI). We use [Weaviate Embedding](https://weaviate.io/blog/introducing-weaviate-embeddings) service here.
2. **Generator Model** - We use Weaviate's integration with OpenAI's API to seamlessly generate an answer based on the user query and retrieved context.

In [5]:
collection_name = 'customer_support'
embedding_model = 'Snowflake/snowflake-arctic-embed-m-v1.5'
embedding_dimension = 1024

if client.collections.exists(collection_name):  # In case we've created this collection before
    client.collections.delete(collection_name)

customer_support = client.collections.create(
    collection_name,

    vectorizer_config=wc.Configure.Vectorizer.text2vec_weaviate(
        model=embedding_model,
    )
)

## Read data

RAG is all about connecting LLMs to data, to better inform their answers. This tutorial uses Nvidia’s Q1 FY2024 earnings report as an example data source for populating the RAG application's knowledge base.

In [None]:
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/NVIDIA_Financial_Results_Q1_FY2024.md'

In [None]:
with open("NVIDIA_Financial_Results_Q1_FY2024.md", "r", encoding="utf-8") as file:
    data = file.read()

print(data[:200])

# NVIDIA Announces Financial Results for First Quarter Fiscal 2024

NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago 


## Chunk documents

To control the size of each context embedded in our vector index and provided to the LLM, we split documents into smaller chunks.
When a user submits a query to the RAG system, the relevant chunks are retrieved from the vector database and used as context for the LLM to generate a response.
Embedding just a few words provides too little information in each context, while embedding entire documents makes it harder to retrieve specific snippets of useful information for accurate responses. <br> 
It's important to find the right chunk size for your use-case, depending on the types of documents and questions that are handled.
Here we use a simple chunking strategy, splitting the text based on a fixed `chunk_size`. Additionally, chunks overlap based on a specified `overlap_size`.

In [7]:
def chunk(text: str, chunk_size: int, overlap_size: int) -> List[str]:
    source_text = re.sub(r"\s+", " ", text)  # Remove multiple whitespaces
    text_words = re.split(r"\s", source_text)  # Split text by single whitespace

    chunks = []
    for i in range(0, len(text_words), chunk_size):  # Iterate through & chunk data
        chunk = " ".join(text_words[max(i - overlap_size, 0): i + chunk_size])  # Join a set of words into a string
        chunks.append(chunk)
    return chunks

In [None]:
chunk_size = 150
overlap_size = 25

chunked_text = chunk(data, chunk_size, overlap_size)
chunked_text

The document is divided into 8 chunks. Now, let's upload them to our vector index.

## Upsert chunks & vectors to Weaviate

In [None]:
chunks_list = list()

for i, chunk in enumerate(chunked_text):
    data_properties = {
        "chunk": chunk,
        "chunk_index": i
    }
    
    chunk_data_object = wvc.data.DataObject(properties=data_properties)
    chunks_list.append(chunk_data_object)

customer_support.data.insert_many(chunks_list)

Run a quick test to verify that all 8 chunks have been upserted and vectorized.

In [None]:
response = customer_support.aggregate.over_all(total_count=True)
print(response.total_count)

for item in customer_support.iterator(include_vector=True):
    print(item.properties)
    break

## Ask questions to our RAG application

Now that the Weaviate database is loaded with text chunks and their corresponding embeddings, we can start querying it to answer questions.

### Easy

We first pose a straightforward question that can be directly answered by the provided data and easily located within a few lines of text.

In [11]:
query = "What was NVIDIA's total revenue in the first quarter of fiscal 2024?"

In [23]:
# Fetch the chunks (limit to 2) which are similar to user's query in the vector space
response = customer_support.query.near_text(
    query=query,
    limit=2
)

chunks = [obj.properties['chunk'] for obj in response.objects]
print('\n'.join(chunks))

# NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter. - **Quarterly revenue** of $7.19 billion, up 19% from the previous quarter - **Record Data Center revenue** of $4.28 billion - **Second quarter fiscal 2024 revenue outlook** of $11.00 billion GAAP earnings per diluted share for the quarter were $0.82, up 28% from a year ago and up 44% from the previous quarter. Non-GAAP earnings per diluted share were $1.09, down 20% from a year ago and up 24% from the previous quarter. Jensen Huang, founder and CEO of NVIDIA, commented on the significant transitions the computer industry is undergoing, particularly accelerated computing and generative AI, emphasizing NVIDIA's role and advancements in these areas. During the first quarter of fiscal
Earnings Per Share | $1.09 | $0.88 | $1.36 | Up 24% | Down 20% | ## Outl

Let's format these chunks into a prompt for the LLM, and call the OpenAI API to get the response.

In [33]:
def create_context(chunks):
    """Create context by concatenating chunks."""
    context = "Context:\n"
    for chunk in chunks:
        context += f"{chunk}\n"
    return context

system_prompt = "You are a helpful AI assistant that answers query strictly based on the provided context.\n"
context = create_context(chunks)
user_prompt = f"{context}\nQuery:\n{query}"

openai_model = 'gpt-4o-mini'
response = openai.chat.completions.create(
    model=openai_model,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ],
)
answer = response.choices[0].message.content
print(answer)

NVIDIA's total revenue in the first quarter of fiscal 2024 was $7.19 billion.


We can easily verify the above response is correct as its evident in the retrieved chunks.

Let's automatically evaluate the response. TrustworthyRAG runs Cleanlab's state-of-the-art LLM uncertainty estimator, the [Trustworthy Language Model](https://cleanlab.ai/tlm/), to provide a **trustworthiness score** indicating overall confidence that your RAG's response is *correct*. 

TrustworthyRAG can simultaneously run additional evaluations to diagnose *why* responses are likely incorrect or other types of issues. <br/>Let's see what Evals are run by default:

In [47]:
default_evals = get_default_evals()
for eval in default_evals:
    print(f"{eval.name}")

context_sufficiency
response_groundedness
response_helpfulness
query_ease


Each Eval returns a score between 0-1 (higher is better) that assesses a different aspect of your RAG system:

1. **context_sufficiency**: Evaluates whether the retrieved context contains sufficient information to completely answer the query. A low score indicates that key information is missing from the context (perhaps due to poor retrieval or missing documents).

2. **response_groundedness**: Evaluates whether claims/information stated in the response are explicitly supported by the provided context.

3. **response_helpfulness**: Evaluates whether the response effectively addresses the user query and appears helpful.

4. **query_ease**: Evaluates whether the user query seems easy for an AI system to properly handle. Complex, vague, tricky, or disgruntled-sounding queries receive lower scores.

To run TrustworthyRAG, we need the prompts sent to the LLM, which includes the system message, retrieved chunks, the user's query, and the LLM's response.

In [34]:
eval_result = trustworthy_rag.score(
    query=query,
    context=context,
    response=answer,
    prompt=system_prompt + user_prompt
)

print("Evaluation results:")
for metric, value in eval_result.items():
    print(f"{metric}: {value['score']}")

Evaluation results:
trustworthiness: 1.0
context_sufficiency: 0.9975123397108121
response_groundedness: 0.9975124375711611
response_helpfulness: 0.9975124251881636
query_ease: 0.9974677786997358


**Analysis:** The `trustworthiness_score` of 1.0 indicates that this response is reliable and correct i.e. non-hallucinated. Even the context is sufficient to respond to this easy query, which is captured by the high `context_sufficiency` and `query_ease` scores.

Let's define a function that runs the above workflow.

In [83]:
def get_answer(query, evaluator = trustworthy_rag):

    # Fetch the relevant chunks from the vector database
    print(f"Query:\n{query}\n")
    response = customer_support.query.near_text(
        query=query,
        limit=2,
    )

    # Create context from the chunks
    chunks = [obj.properties['chunk'] for obj in response.objects]
    context = create_context(chunks)
    print(context)

    # Generate the answer using LLM
    system_prompt = "You are a helpful AI assistant that answers query based on the context.\n"
    user_prompt = f"{context}\nQuery:\n{query}"

    openai_model = 'gpt-4o-mini'
    response = openai.chat.completions.create(
        model=openai_model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
    )
    answer = response.choices[0].message.content
    print(f"Generated response:\n{answer}\n")

    # Evaluate the response using TrustworthyRAG
    eval_result = evaluator.score(
        query=query,
        context=context,
        response=answer,
        prompt=system_prompt + user_prompt
    )

    print("Evaluation results:")
    for metric, value in eval_result.items():
        print(f"{metric}: {value['score']}")

Now let’s run a **out-of-data** query that **cannot** be answered using the provided document.

In [75]:
get_answer("How does the report explain why NVIDIA's Gaming revenue decreased year over year?")

Query:
How does the report explain why NVIDIA's Gaming revenue decreased year over year?

Context:
# NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter. - **Quarterly revenue** of $7.19 billion, up 19% from the previous quarter - **Record Data Center revenue** of $4.28 billion - **Second quarter fiscal 2024 revenue outlook** of $11.00 billion GAAP earnings per diluted share for the quarter were $0.82, up 28% from a year ago and up 44% from the previous quarter. Non-GAAP earnings per diluted share were $1.09, down 20% from a year ago and up 24% from the previous quarter. Jensen Huang, founder and CEO of NVIDIA, commented on the significant transitions the computer industry is undergoing, particularly accelerated computing and generative AI, emphasizing NVIDIA's role and advancements in these areas. During t

**Analysis:** The generator LLM avoids conjecture by providing a reliable response, as seen in the high `trustworthiness_score`. The low `context_sufficiency` score reflects that the retrieved context was lacking, and the response doesn’t actually answer the user’s query, as indicated by the low `response_helpfulness`.

Let’s see how our RAG system responds to **challenging** questions, which may be misleading.

In [79]:
get_answer("How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?")

Query:
How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?

Context:
# NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter. - **Quarterly revenue** of $7.19 billion, up 19% from the previous quarter - **Record Data Center revenue** of $4.28 billion - **Second quarter fiscal 2024 revenue outlook** of $11.00 billion GAAP earnings per diluted share for the quarter were $0.82, up 28% from a year ago and up 44% from the previous quarter. Non-GAAP earnings per diluted share were $1.09, down 20% from a year ago and up 24% from the previous quarter. Jensen Huang, founder and CEO of NVIDIA, commented on the significant transitions the computer industry is undergoing, particularly accelerated computing and generative AI, emphasizing NVIDIA's role and advancements in these areas. During th

**Analysis**: The generated response incorrectly states that NVIDIA's revenue decreased this quarter, when in fact the referenced report notes a 19% increase quarter-over-quarter. 

This mismatch between the query and the context leads to a very low `response_groundedness` score of 0.04 and a `trustworthiness_score` of just 0.08, indicating that the answer not only fabricates financial trends but also misleads with confident but false arithmetic. 

Let's try another one:

In [87]:
get_answer("If NVIDIA's Data Center segment maintains its Q1 FY2024 quarter-over-quarter growth rate for the next four quarters, what would be its projected annual revenue?")

Query:
If NVIDIA's Data Center segment maintains its Q1 FY2024 quarter-over-quarter growth rate for the next four quarters, what would be its projected annual revenue?

Context:
Earnings Per Share | $1.09 | $0.88 | $1.36 | Up 24% | Down 20% | ## Outlook NVIDIA’s outlook for the second quarter of fiscal 2024 includes: - **Revenue**: Expected to be $11.00 billion, plus or minus 2%. - **Gross Margins**: GAAP and non-GAAP gross margins are expected to be 68.6% and 70.0%, respectively, plus or minus 50 basis points. - **Operating Expenses**: GAAP and non-GAAP operating expenses are expected to be approximately $2.71 billion and $1.90 billion, respectively. - **Tax Rates**: GAAP and non-GAAP tax rates are expected to be 14.0%, plus or minus 1%, excluding any discrete items. ## Highlights NVIDIA has made significant progress in various areas since its last earnings announcement: ### Data Center - **First-quarter revenue** was a record $4.28 billion, up 14% from a year ago and up 18% from the 

**Analysis**: The generated response overstates (sums up the financials of Q1) the projected revenue and leads to a low `trustworthiness_score` of 0.27.

### Custom Evals

You can also specify custom evaluations to assess specific criteria, and combine them with the default evaluations for comprehensive/tailored assessment of your RAG system.

For instance, here's how to create and run a custom eval that checks the conciseness of the generated response.

In [89]:
conciseness_eval = Eval(
    name="response_conciseness",
    criteria="Evaluate whether the Generated response is concise and to the point without unnecessary verbosity or repetition. A good response should be brief but comprehensive, covering all necessary information without extra words or redundant explanations.",
    response_identifier="Generated Response"
)

# Combine default evals with a custom eval
combined_evals = get_default_evals() + [conciseness_eval]

# Initialize TrustworthyRAG with combined evals
combined_trustworthy_rag = TrustworthyRAG(evals=combined_evals)

In [101]:
get_answer("What significant transitions did Jensen comment on?", evaluator=combined_trustworthy_rag)

Query:
What significant transitions did Jensen comment on?

Context:
# NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter. - **Quarterly revenue** of $7.19 billion, up 19% from the previous quarter - **Record Data Center revenue** of $4.28 billion - **Second quarter fiscal 2024 revenue outlook** of $11.00 billion GAAP earnings per diluted share for the quarter were $0.82, up 28% from a year ago and up 44% from the previous quarter. Non-GAAP earnings per diluted share were $1.09, down 20% from a year ago and up 24% from the previous quarter. Jensen Huang, founder and CEO of NVIDIA, commented on the significant transitions the computer industry is undergoing, particularly accelerated computing and generative AI, emphasizing NVIDIA's role and advancements in these areas. During the first quarter of fiscal
Cyb

### Replace your LLM with Cleanlab's

Beyond evaluating responses already generated from your LLM, Cleanlab can also generate responses and evaluate them simultaneously (using one of many [supported models](https://help.cleanlab.ai/tlm/api/python/tlm/#class-tlmoptions)). <br />
This replaces your own LLM within your RAG system and can be more convenient/accurate/faster.

Let's replace OpenAI LLM with a line to call Cleanlab's endpoint:

In [102]:
query = "How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?"

print(f"Query:\n{query}\n")

relevant_chunks = customer_support.query.near_text(
    query=query,
    limit=2,
)

# Create context from the chunks
chunks = [obj.properties['chunk'] for obj in relevant_chunks.objects]
context = create_context(chunks)
print(context)

Query:
How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?

Context:
# NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter. - **Quarterly revenue** of $7.19 billion, up 19% from the previous quarter - **Record Data Center revenue** of $4.28 billion - **Second quarter fiscal 2024 revenue outlook** of $11.00 billion GAAP earnings per diluted share for the quarter were $0.82, up 28% from a year ago and up 44% from the previous quarter. Non-GAAP earnings per diluted share were $1.09, down 20% from a year ago and up 24% from the previous quarter. Jensen Huang, founder and CEO of NVIDIA, commented on the significant transitions the computer industry is undergoing, particularly accelerated computing and generative AI, emphasizing NVIDIA's role and advancements in these areas. During th

In [103]:
# Generate the answer using LLM
system_prompt = "You are a helpful AI assistant that answers query based on the context.\n"
user_prompt = f"{context}\nQuery:\n{query}"

result = trustworthy_rag.generate(
    query=query,
    context=context,
    prompt=system_prompt + user_prompt
)

print(f"Generated Response:\n{result['response']}\n")
print("Evaluation Scores:")
for metric, value in result.items():
    if metric != "response":
        print(f"{metric}: {value['score']}")

Generated Response:
NVIDIA's revenue for the first quarter was $7.19 billion, which is up 19% from the previous quarter. To find the revenue for the previous quarter, we can calculate it as follows:

Let \( x \) be the revenue for the previous quarter. According to the information provided:

\[ x + 0.19x = 7.19 \]
\[ 1.19x = 7.19 \]
\[ x = \frac{7.19}{1.19} \]
\[ x \approx 6.04 \text{ billion} \]

Now, to find the decrease in revenue from the previous quarter to this quarter, we can calculate:

\[ 7.19 - 6.04 = 1.15 \text{ billion} \]

Therefore, NVIDIA's revenue did not decrease this quarter compared to the last quarter; instead, it increased by approximately $1.15 billion.

Evaluation Scores:
trustworthiness: 0.7047853317728329
context_sufficiency: 0.9793883585825877
response_groundedness: 0.16683115793235678
response_helpfulness: 0.2743783166467349
query_ease: 0.41290746161713915


While it remains hard to achieve a RAG application that will accurately answer *any* possible question, you can easily use Weaviate and Cleanlab to deploy a *trustworthy* RAG application which at least flags answers that are likely inaccurate.