[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/weaviate-features/model-providers/anthropic/rag_with_citations_anthropic_claude-3-5-sonnet.ipynb)

# RAG with Weaviate and Anthropic's Citations API

Notebook author: Danny Williams @ Weaviate


## Overview

[Anthropic's Citations API](https://www.anthropic.com/news/introducing-citations-api) is an integration from Anthropic which cites relevant parts of documents in the response from the LLM.

It will output the text of the relevant parts of the document, as well as the positions of the response from the LLM that the citation is from.

This notebook shows how to use the Citations API with Weaviate's vector database retriever and Anthropic's LLM.


## Data (ML Wikipedia Articles)

To run this section you will need to 
```
pip install wikipedia
```

This section searches for a term with Wikipedia, takes the most relevant first page and then scrapes the page and all the links on the page. So a search for "machine learning" will return also every embedded link on the page for 'machine learning' and all the associated pages. This will create a large number of documents that will serve as our knowledge base for the RAG pipeline.


In [1]:
import wikipedia
from wikipedia import WikipediaPage

def get_page(title):
    try:
        return WikipediaPage(title)
    except wikipedia.exceptions.DisambiguationError as e:
        return None
    except wikipedia.exceptions.PageError as e:
        return None
    
def get_page_contents(page):
    words = page.content.split()
    content = ' '.join(words[:4000]) # limit to 4000 words so we don't hit the vectorizer token limit
    return {
        "title": page.title.replace('/', '_'),
        "content": content,
        "categories": page.categories
    }

def scrape_wikipedia_page(page_title):
    search_results = wikipedia.search(page_title, results=1)
    page = get_page(search_results[0])
    data = [get_page_contents(page)]

    for link in page.links:
        page = get_page(link)
        if page is not None:
            data.append(get_page_contents(page))
    return data

Now we need to add the data to Weaviate. We will use the `text2vec-openai` vectorizer to embed the data, but the actual vectorizer is not important for the Citations API, this is just for the embeddings to search with via the vector database retriever.

In [2]:
import weaviate
from weaviate.classes.init import Auth
from weaviate.classes.config import Property, DataType, Configure

import os
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WCD_URL"),
    auth_credentials=Auth.api_key(os.environ.get("WCD_API_KEY")),
    headers = {"X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")} # use an openai vectorizer - just for the embeddings (not the generative/LLM part)
)

# Create a new collection
collection = client.collections.create(
    "wikipedia",
    properties=[
        Property(name="title", data_type=DataType.TEXT, description="The title of the Wikipedia page"),
        Property(name="content", data_type=DataType.TEXT, description="The content of the Wikipedia page"),
        Property(name="categories", data_type=DataType.TEXT_ARRAY, description="The categories of the Wikipedia page")
    ],
    vectorizer_config=Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-large",
        dimensions=256
    )
)
data = scrape_wikipedia_page("machine learning")

count = 0
for i, item in enumerate(data):
    try:
        collection.data.insert(
            properties=item,
        )
    except Exception as e:
        count += 1
        print(f"Error inserting item {i}. Total missing items: {count}/{len(data)}")


Error inserting item 182. Total missing items: 1/882


import weaviate
from weaviate.classes.init import Auth
from weaviate.classes.config import Property, DataType, Configure

import os
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WCD_URL"),
    auth_credentials=Auth.api_key(os.environ.get("WCD_API_KEY")),
    headers = {"X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")} # use an openai vectorizer - just for the embeddings (not the generative/LLM part)
)
client.collections.delete("wikipedia")


## RAG with Weaviate and Anthropic

These functions are used for the stages of the RAG pipeline. 

In [3]:

# Get the Weaviate collection we just made
collection = client.collections.get("wikipedia")

# Use Weaviate's hybrid search to get the most relevant documents according to the query
def query_wikipedia(query):
    response = collection.query.hybrid(query, limit=5)
    return [object.properties for object in response.objects]

# Format the message for Anthropic's api to format the documents correctly, for the Citations API to work
def create_message_with_documents(prompt, objects):
    message = [{
        "role": "user",
        "content": []
    }]
    for object in objects:
        message[0]["content"].append({
            "type": "document",
            "source": {
                "type": "text",
                "media_type": "text/plain",
                "data": object["content"]
            },
            "title": object["title"],
            "context": str(object["categories"]),
            "citations": {"enabled": True}
        })

    message[0]["content"].append({
        "type": "text",
        "text": prompt
    })

    return message

# Run the query, and generate the response from Anthropic's LLM
def generate_response(anthropic_client, prompt):
    objects = query_wikipedia(prompt)
    response = anthropic_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=4096,
        messages=create_message_with_documents(prompt, objects)
    )
    return response

## Run the RAG pipeline

For this section we will use the `anthropic` library to interact with Anthropic's LLM.
I.e.
```
pip install anthropic
```



In [4]:
import anthropic
anthropic_client = anthropic.Anthropic()

response = generate_response(anthropic_client, "Explain all the different types of machine learning models: regression, classification, clustering, language models and image modelling")

## Format the response

We will use the `textwrap3` library to format the response and the citations.
```
pip install textwrap3
```
This is just used to colour code the citations in the response, so the relevant parts of the LLM's response are highlighted, and underneath, the references are listed in the same colour.

In [5]:
import textwrap3

def wrap_text_preserve_breaks(text, width):
    lines = text.split('\n')
    wrapped_lines = [textwrap3.fill(line, width=width) for line in lines]
    return '\n'.join(wrapped_lines)


citations = []
printed_text = ""
# from rich import print
citation_colours = [ '\033[93m', '\033[91m', '\033[92m', '\033[93m', '\033[94m']
for i, text in enumerate(response.content):
    if text.citations is not None:

        for citation in text.citations:
            citations.append({
                "text": citation.cited_text,
                "title": citation.document_title,
                "colour": citation_colours[citation.document_index % len(citation_colours)]
            })

        printed_text += f"{citation_colours[citation.document_index % len(citation_colours)]}{text.text}"
    else:
        printed_text += f"\033[0m{text.text} "

cited_text = "REFERENCES\n__________\n\n"
for i, citation in enumerate(citations):
    cited_text += f"\033[0m(Source {i+1}) {citation['colour']}{citation['title']}:\n"
    cited_text += citation['text'] + "\n\n"


print(wrap_text_preserve_breaks(printed_text, width=90), end="\n\n")
print(wrap_text_preserve_breaks(cited_text, width=90))


[0mBased on the provided documents, I'll explain the different types of machine learning
models:

1. Classification Models:
 [93mClassification is performed by computers using statistical methods. Individual
observations are analyzed using a set of quantifiable properties called explanatory
variables or features.[0m

Key aspects of classification:
-  [93mProbabilistic classification is a common subclass that:
- Uses statistical inference to find the best class
- Outputs probabilities for each possible class
- Usually selects the class with highest probability
- Can output confidence values
- Can abstain when confidence is too low[0m

2. Types of Classification:
 [93mClassification can be:
- Binary classification: involves only two classes
- Multiclass classification: involves assigning an object to one of several classes
- Many classification methods are specifically designed for binary classification and
require combination for multiclass problems[0m

3. Clustering Models:
 [9