<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2024 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email author@email.address.<br />
____

# `Introduction to Retrieval Augmented Generation (RAG)` `2`

This is lesson `2` of 3 in the educational series on `Introduction to Retrieval Augmented Generation (RAG)`. This notebook is intended `to teach the basic concepts RAG` and how to build a RAG system in Python.

**Skills:** 
* Data analysis
* Machine learning
* Text analysis
* spaCy
* Vector databases
* Semantic search
* Python

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Intermediate`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
* Object-oriented programming (classes, instances, inheritance)
* Regular Expressions (`re`, character classes)

These should be general skills but can mention a particular library
```

**Knowledge Recommended:**
```
* Basic file operations (open, close, read, write)
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Learn about the foundational concepts of how we represent data numerically, specifically textual data.
2. Learn about TF-IDF representations of texts.
3. Learn about the core concept behind vectors, or embeddings.
4. Learn how to vectorize texts with Python.
5. Learn about the importance of transformer models
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* spaCy for processing texts
* numpy for working with arrays
* matplotlib for visualizing data
* scikit-learn for PCA



## Install Required Libraries

In [5]:
### Install Libraries ###

# Using !pip installs
!pip install weaviate-client spacy pandas srsly

Collecting weaviate-client
  Using cached weaviate_client-4.7.1-py3-none-any.whl.metadata (3.3 kB)
Collecting validators==0.33.0 (from weaviate-client)
  Using cached validators-0.33.0-py3-none-any.whl.metadata (3.8 kB)
Collecting authlib<2.0.0,>=1.2.1 (from weaviate-client)
  Using cached Authlib-1.3.1-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting grpcio<2.0.0,>=1.57.0 (from weaviate-client)
  Using cached grpcio-1.65.1-cp310-cp310-macosx_12_0_universal2.whl.metadata (3.3 kB)
Collecting grpcio-tools<2.0.0,>=1.57.0 (from weaviate-client)
  Using cached grpcio_tools-1.65.1-cp310-cp310-macosx_12_0_universal2.whl.metadata (5.3 kB)
Collecting grpcio-health-checking<2.0.0,>=1.57.0 (from weaviate-client)
  Using cached grpcio_health_checking-1.65.1-py3-none-any.whl.metadata (1.1 kB)
Collecting cryptography (from authlib<2.0.0,>=1.2.1->weaviate-client)
  Using cached cryptography-43.0.0-cp39-abi3-macosx_10_9_universal2.whl.metadata (5.4 kB)
Collecting protobuf<6.0dev,>=5.26.1 (from grpcio

In addition to this, we will be downloading two spaCy models: `en_core_web_md` and `en_core_web_lg`.

In [6]:
!python -m spacy download en_core_web_sm

  _torch_pytree._register_pytree_node(
Collecting en-core-web-md==3.7.1
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [2]:
import os

os.environ["OPENAI_API_KEY"] = ""

In [56]:
### Import Libraries ###
import spacy
import srsly
from weaviate.classes.init import Auth
import weaviate.classes as wvc
import weaviate

import pandas as pd

# Introduction

In this notebook, we will learn about RAG specifically. We will learn how the technologies of LLMs and Vector Databases can be brought together to build a RAG system. We will also learn how to build a naive RAG system in Python.

# Retrieval Augmented Generation

Retrieval augmented generation (RAG) brings together vector databases and LLMs to leverage the strength of each while, in theory, helping mitigate their weaknesses. Let's learn  a bit about how this technology works.

## How does RAG Work?

Behind RAG sits a vector database. This is used to store knowledge that can be used to generate better responses from a LLM. When a user queries a vector database with RAG, they don't explicitly perform semantic search, as you normally would with a vector database. Instead, they frame a natural question, just like they would with an LLM. A RAG application then goes into the vector database and finds the sources that most align with the specific query. The RAG system then gathers this material and provides it to an LLM along with the user's initial question. These retrieved documents frame the context for guiding the LLM to generate a more precise and specific response.

The RAG system then receives the output from the LLM and provides the response to the user. Because the LLM's response is rooted in context explicitly provided to it, the RAG system knows the original source material that frame the response. This means that the RAG system can help the user validate the response from the LLM by providing the user the specific sections of each document that was used to provide a better response.


## Strengths of RAG

The key strength of a RAG system is the ability to provide LLMs with a new knowledge base. This is often knowledge that they were not trained on or knowledge that was out-of-scope.

In addition to this, a RAG system can limit the chance for hallucinations. By providing explicitly the context needed to answer a user's question, the LLM's chances of hallucinating are diminished. It is important to note, though, that they are not removed.

Finally, a RAG system allows a user to understand specifically *why* an LLM generates a response by providing the user with the original source material used to frame the response. Thus, a user not only gets a response from an LLM, they also can navigate the original data and find relevant material more easily.

## Weaknesses of RAG

While RAG systems seem like the perfect answer to the limitations of LLMs, they must be used cautiously. RAG is still a fairly new technology and current research suggests that the chances for hallucinations have not entirely disappeared. Even if a user has access to the original source material, what if they opt to not view it? If a hallucination is generated by an LLM on your system and you function as an authority, does that make the LLM's response an authority? These are the questions that make it challenging to put RAG systems into production, especially when working with cultural heritage data.




# Loading the Data

To create a vector database, you must start with data. We will be working with data from [Founders Online](https://founders.archives.gov/). This data is available in the `../data/processed/` folder. I have provided us with a sample of 10, 100, and 1,000. For this notebook, we will be working with a sample of 1,000. It is important to note that I have seeded the random sample. This means that the data you work with each time will be the same. To get different data, change the seed of the script, located in `./src/data/` called `download_data.py`.

This data is a collection of writings of the Founders. The data is useful for doing social network analysis as many of the writings are letters. It's also useful for mapping writings across time and space as many of the writings are dated and contain information about specific locations. For our purposes, however, we will be working with the main content of the letters to create a vector database. 

To get started, let's load the data. We will be using `srsly`. I'm including this in this tutorial as a way to introduce students to the library. You can also use the standard `json` package here. `srsly` has a few advantages, namely it loads the data as a generator. This is useful when you start working with larger datasets (as you typically do with vector databases) because the entire dataset is not loaded into memory at once. Because of this, though, we want to convert it to a list just to make it a bit easier to use for our purposes, so when loading the data, we convert it to a list with the `list()` function.


In [15]:
data = list(srsly.read_json("../data/processed/sample_1000_42.json"))

Now that we have loaded up our data, let's take a brief look at it by examining the first index.

In [16]:
data[0]

{'title': 'Thomas Jefferson to Joseph Milligan, 22 December 1815',
 'permalink': 'https://founders.archives.gov/documents/Jefferson/03-09-02-0174',
 'project': 'Jefferson Papers',
 'authors': ['Jefferson, Thomas'],
 'recipients': ['Milligan, Joseph'],
 'date-from': '1815-12-22',
 'date-to': '1815-12-22',
 'content': 'Monticello Dec. 22. 15.\nDear Sir\nOn my return here from Bedford a few days ago, I found the Hutton and Requisite tables, bound to my mind. by this mail I send you an Ovid’s metamorphoses almost entirely worne out & defaced, yet of sovaluable and rareaneditionthat I wish you to put it into as good a state of repair as it is susceptible of. by the next mail I will forward a Cornelius Nepos to be bound. be so good as to procure and forward to me by stage the underwritten books.I salute you with friendship & esteem\nTh: Jefferson\nAinsworth’sLat. & Eng. dict. abridged. to be bound[. . .]\nthe Lat. & Eng in one, & the Eng. & Lat.[. . .]\nOvid’s metamorphoses. the Delphin edn 

Notice that we have some important metadata here including the title, permalink (link to the website where this particular entry appears), project, authors, recipients, date-from, date-to, and content. Everything here is as presented in the original metadata.json file with the exception of `content`. I have added this after pulling the data from the website. We will learn about how we can make these extra attributes more useful in the next notebook. For now, let's focus on the `content` attribute.

This gives us a better sense of this letter. As we can see this is a raw-string representation of the data. The odd formatting is due to how the data is rendered on the main page, likely to capture the structure of the original document. If you want to verify what the original document looks like, use the link below. Here is what it looks like as of the writing of this notebook.

<img src="../assets/founders.png" alt="Founders image" style="max-height: 500px; width: auto;">

Let's load the data into a Pandas DataFrame so that it will be a bit easier to analyze.

In [17]:
df = pd.DataFrame(data)
df

Unnamed: 0,title,permalink,project,authors,recipients,date-from,date-to,content
0,"Thomas Jefferson to Joseph Milligan, 22 Decemb...",https://founders.archives.gov/documents/Jeffer...,Jefferson Papers,"[Jefferson, Thomas]","[Milligan, Joseph]",1815-12-22,1815-12-22,Monticello Dec. 22. 15.\nDear Sir\nOn my retur...
1,"To Alexander Hamilton from James McHenry, 3 Ma...",https://founders.archives.gov/documents/Hamilt...,Hamilton Papers,"[McHenry, James]","[Hamilton, Alexander]",1791-05-03,1791-05-03,[Baltimore] 3 May 1791.\nMy dear Sir.\nI did n...
2,John Adams to John Quincy Adams and Thomas Boy...,https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]","[Adams, John Quincy, Adams, Thomas Boylston]",1794-09-14,1794-09-14,Quincy Septr.14. 1794\nMy dear Sons\nI once mo...
3,From George Washington to Major General Horati...,https://founders.archives.gov/documents/Washin...,Washington Papers,"[Washington, George]","[Gates, Horatio]",1776-12-23,1776-12-23,"Head Quarters [Bucks County, Pa.] 23d Decr 177..."
4,[Diary entry: 5 July 1795],https://founders.archives.gov/documents/Washin...,Washington Papers,"[Washington, George]",[],1795-07-05,1795-07-05,Could not find the main content
...,...,...,...,...,...,...,...,...
995,"From John Adams to Boston Patriot, 4 November ...",https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[Boston Patriot],1809-11-04,1809-11-04,"Quincy, November 4, 1809.\nSirs,\nIn my last l..."
996,"From John Adams to United States Senate, 14 Ma...",https://founders.archives.gov/documents/Adams/...,Adams Papers,"[Adams, John]",[United States Senate],1798-03-14,1798-03-14,United States March 14th 1798:\nGentlemen of t...
997,"To Benjamin Franklin from William Henly, [Apri...",https://founders.archives.gov/documents/Frankl...,Franklin Papers,"[Henly, William]","[Franklin, Benjamin]",1772-04-01,1772-04-30,"Sunday Eve. [April?, 1772]\nDear Sir:\nI have ..."
998,From George Washington to Major General Alexan...,https://founders.archives.gov/documents/Washin...,Washington Papers,"[Washington, George]","[McDougall, Alexander]",1779-05-20,1779-05-20,Head Quarters Middle Brook May 20th 1779\nDr S...


# Importance of Metadata

Metadata for documents function like features for each document. They allow us an entry point to access it. In other words, metadata makes a piece of data more discoverable. Think of a library. Books have metadata which has been carefully cultivated by librarians. This metadata has numerous features. Consider the example below.

| Title | Author | ISBN | Publication Year | Genre |
|-------|--------|------|------------------|-------|
| To Kill a Mockingbird | Harper Lee | 978-0446310789 | 1960 | Classic Fiction |
| 1984 | George Orwell | 978-0451524935 | 1949 | Dystopian Fiction |
| The Great Gatsby | F. Scott Fitzgerald | 978-0743273565 | 1925 | Literary Fiction |
| Pride and Prejudice | Jane Austen | 978-0141439518 | 1813 | Romance |
| The Catcher in the Rye | J.D. Salinger | 978-0316769174 | 1951 | Coming-of-age Fiction |
| The Hobbit | J.R.R. Tolkien | 978-0547928227 | 1937 | Fantasy |
| The Da Vinci Code | Dan Brown | 978-0307474278 | 2003 | Thriller |
| The Hunger Games | Suzanne Collins | 978-0439023528 | 2008 | Young Adult Dystopian |
| The Girl with the Dragon Tattoo | Stieg Larsson | 978-0307454546 | 2005 | Crime Fiction |
| The Alchemist | Paulo Coelho | 978-0062315007 | 1988 | Philosophical Fiction |

In this example, we have very important metadata that make each book more accessible to users of a library catalog. We have author, ISBN, Publication Year, and Genre. Imagine a user wanted to only find Classic Fiction written before 1970, they could filter this data by those features and return the necessary results.

In this notebook and the following one, we will learn the role of metadata in vector databases and RAG systems so that we can retrieve relevant material both by query and by filters.

# Chunking Data

Before we can begin creating a RAG system, though, we need to manipulate our data. One of the key issues with LLMs is that they have a limited context window, or quantity of tokens (think of these as words for now) that they can ingest. The context window is filled by both the input text and the output from the LLM. This means that we cannot simply give smaller LLMs an entire book and be able to get a specific output. In order to resolve this issue, we often need to chunk our data.

Chunking is when we take an input text and turn it into a sequence of smaller components. When chunking, we have a lot of methods to choose from.

## Character-Based Chunking

Character-based chunking involves splitting the text into chunks based on a fixed number of characters. This method is simple but can be problematic as it may split words or sentences in unnatural places.

Pros:
- Simple to implement
- Consistent chunk sizes

Cons:
- May split words or sentences inappropriately
- Doesn't consider semantic meaning

Example:
```python
def character_chunk(text, chunk_size=1000):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
```

## Whitespace-Based Chunking

Whitespace-based chunking splits text at whitespace characters (spaces, tabs, newlines) to create chunks. This method preserves whole words but may result in varying chunk sizes.

Pros:
- Preserves whole words
- Simple to implement

Cons:
- Can result in inconsistent chunk sizes
- May split sentences or paragraphs

Example:
```python
def whitespace_chunk(text, max_words=200):
    words = text.split()
    return [' '.join(words[i:i+max_words]) for i in range(0, len(words), max_words)]
```

## Token-Based Chunking

Token-based chunking uses a tokenizer to split text into tokens (which can be words, subwords, or characters depending on the tokenizer) and then groups these tokens into chunks of a specified size.

Pros:
- Aligns well with LLM token limits
- Can be more precise than character or whitespace chunking

Cons:
- Requires a tokenizer
- May still split sentences inappropriately

Example:
```python
from transformers import AutoTokenizer

def token_chunk(text, max_tokens=512):
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    tokens = tokenizer.encode(text)
    return [tokenizer.decode(tokens[i:i+max_tokens]) for i in range(0, len(tokens), max_tokens)]
```

## Sentence-Based Chunking

Sentence-based chunking splits text into individual sentences and then groups these sentences into chunks. This method preserves sentence integrity but may result in varying chunk sizes.

Pros:
- Preserves sentence meaning
- More semantically coherent chunks

Cons:
- Can result in inconsistent chunk sizes
- Requires sentence tokenization

Example (using spaCy):
```python
import spacy

def sentence_chunk(text, max_sentences=10):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents]
    return [' '.join(sentences[i:i+max_sentences]) for i in range(0, len(sentences), max_sentences)]
```

## Paragraph-Based Chunking

Paragraph-based chunking splits text into paragraphs and then groups these paragraphs into chunks. This method preserves paragraph integrity and is useful for maintaining context.

Pros:
- Preserves paragraph context
- Often aligns with natural text structure

Cons:
- Can result in very inconsistent chunk sizes
- May produce overly large chunks for long paragraphs

Example:
```python
def paragraph_chunk(text, max_paragraphs=5):
    paragraphs = text.split('\n\n')
    return ['\n\n'.join(paragraphs[i:i+max_paragraphs]) for i in range(0, len(paragraphs), max_paragraphs)]
```

Each chunking method has its own advantages and disadvantages. The choice of method often depends on the specific requirements of your RAG system, the nature of your text data, and the capabilities of your LLM. It's common to experiment with different chunking methods to find the one that works best for your particular use case.

## Semantic-Based Chunking

Semantic-Based Chunking performs sentence-based chunking. Next, it looks for sentences that are similar and merges them together into a single chunk. The goal of this approach is to have chunks that are distinct in meaning across an entire document.

Pros:
- Keeps chunks highly relevant
- Allows for you to retrieve fewer results to provide the necessary context to answer a question

Cons:
- Could be difficult to understand broader context of parts of a chunk

# Framing a Historical Question

For this exercise, we will be working as historians researching Abigail Smith Adams. We are interested in knowing about her relationship with her sisters. This will eventually be worked into a book about sororal relationships in the 18th and 19th centuries. We aren't experts on Abigail Adams, but someone mentioned to us that she references her sisters frequently in her letters and that there's a particular letter that she wrote to her daughter that speaks about the moment she first learned of her sister passing away. We also learned that this particular letter is available somewhere on Founders Online.

Your goal for this exercise is to find that particular letter by using the Founders Online. If you like, you can also use the DataFrame, `df`, or the `data` object we have already loaded. The correct letter is among these 1,000 documents from Founders Online.

Scroll down to the end of this notebook for a potential solution using Python without RAG.

# Using a LLM in Isolation

For this next exercise, use whatever LLM service you wish, such as ChatGPT. Try to find the answer to this specific question. What did you notice about the responses?

# Querying a RAG System

This particular source is probably difficult to find and this is  not a critique of the Founders Online. We have the ability to filter based on metadata, so we can filter based on the letters of Abigail Adams. We can even search across those letters with keywords, but we will have numerous hits. We could realistically spend the next 20 minutes or maybe even a few hours trying to find this one specific letter. Another approach, though, is to leverage machine learning, specifically RAG.

## Weaviate

For this example, we will be working with Weaviate. Weaviate is an open-source and cloud-based vector database framework. It allows for you to host your own server locally or subscribe to their services and pay a monthly fee to host a Weaviate cluster remotely on their servers. Either way, the approach is precisely the same once the server is up and running.

We will learn how to populate a Weaviate server with data in the next notebook. For now, I have already loaded a sample of 300 documents that have already been chunked for us.

To access a Weaviate server, you need to do so via an API. In Python, we can do this with the Weaviate-Client library. This package handles everything from populating a server with data to querying the server for vector similarity or, in our case, RAG.

Because we will be doing everything in the Cloud, we need to have an OpenAI API key. If you haven't done so already, please make an OpenAI account and create an API key and place it in the environment. Scroll to the top of the notebook for doing this. Remember, each time you query with the code below, you will be charged. These costs will be minimal, but you should be aware of it.

Once we have done that, we can query the vector database and generate an LLM response.

The first thing that we will need to do is define our Weaviate URL and API key. I have prepared both of these for you. Warning, this instance will depreciate after class, so this notebook will require these two values to be updated with a new server and a new API key. You'll learn about making these in the next notebook.

This API key is read-only.

In [37]:
WEAVIATE_URL = "https://ab8fkdlqrxahroit5qsbma.c0.us-west3.gcp.weaviate.cloud"
WEAVIATE_API_KEY  = "4Q9nn8ytsr9kZnMbigfVxgYwIUfuxl3hFiFU"

Once we have these loaded into memory, we can access the Weaviate server. To do that we, will use the following sample of code. This connects to a Weaviate server, it specifies the server we want to connect to, it provides the credentials, our API key, and passes a few extra headers, specifically our API key for OpenAI.

In [38]:
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=WEAVIATE_URL,
    auth_credentials=Auth.api_key(WEAVIATE_API_KEY),
    headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
    }
)

Once we are connected to the server, we will be able to query it, but first, we need to access a specific collection on this server. In Weaviate, you store vectors inside of collections. This allows for you to store vectors for multiple projects on the same server. For this project, we will work with a collection I made called `Founders`. Let's access that collection now.

In [40]:
founders = client.collections.get("Founders")

            Please make sure to close the connection using `client.close()`.


Once connected to the server, we can inspect it. A good way to do this is with `aggregate.over_all()`. We can specify with a Boolean that we want to see the total count.

In [42]:
founders.aggregate.over_all(total_count=True)

AggregateReturn(properties={}, total_count=300)

This tells us that we have 300 entries in the collection. Let's go ahead now and query the cluster. We will pose a natural language question. We can do this with the `generate.near_text()` method on the collection of `founders`. Here, we will pass only two arguments, the query, or the question we want answered, and the limit. Setting the limit to 1 means we will only get a single response.

In [43]:
response = founders.generate.near_text(
    query="What does Abigail Adams say about her sister's death?",
    limit=1
)
response

GenerativeReturn(objects=[GenerativeObject(uuid=_WeaviateUUIDInt('67ac59c7-1f66-48f6-b9d8-96636887fa12'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'chapter_title': 'From Abigail Smith Adams to Louisa Catherine Johnson Adams, 14 April 1815', 'chunk_index': 0, 'chunk': 'From Abigail Smith Adams to Louisa Catherine Johnson Adams, 14 April 1815\nQuincy April 14th 1815\nMy dear Daughter\nI address you, altho I know not where to find you, which is, and has been a source of much anxiety to me, four months have elapsed since the signature of the Treaty of Peace; when mr Adams wrote from Ghent, that in ten day’s, he should go to Paris, and from thence, send on to St petersburgh, to request you to join him there, and if he should, (as was expected,) be sent to England, that your Sons might be Sent to join their parents there. Altho Since the 27th of decembe

A response contains numerous pieces of data. Let's grab the `objects` to get access to the key parts of the response we need.

In [44]:
response.objects

[GenerativeObject(uuid=_WeaviateUUIDInt('67ac59c7-1f66-48f6-b9d8-96636887fa12'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'chapter_title': 'From Abigail Smith Adams to Louisa Catherine Johnson Adams, 14 April 1815', 'chunk_index': 0, 'chunk': 'From Abigail Smith Adams to Louisa Catherine Johnson Adams, 14 April 1815\nQuincy April 14th 1815\nMy dear Daughter\nI address you, altho I know not where to find you, which is, and has been a source of much anxiety to me, four months have elapsed since the signature of the Treaty of Peace; when mr Adams wrote from Ghent, that in ten day’s, he should go to Paris, and from thence, send on to St petersburgh, to request you to join him there, and if he should, (as was expected,) be sent to England, that your Sons might be Sent to join their parents there. Altho Since the 27th of december, no letter has reached 

Because we set our limit to 1, let's just grab the first instance.

In [45]:
response.objects[0]

GenerativeObject(uuid=_WeaviateUUIDInt('67ac59c7-1f66-48f6-b9d8-96636887fa12'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'chapter_title': 'From Abigail Smith Adams to Louisa Catherine Johnson Adams, 14 April 1815', 'chunk_index': 0, 'chunk': 'From Abigail Smith Adams to Louisa Catherine Johnson Adams, 14 April 1815\nQuincy April 14th 1815\nMy dear Daughter\nI address you, altho I know not where to find you, which is, and has been a source of much anxiety to me, four months have elapsed since the signature of the Treaty of Peace; when mr Adams wrote from Ghent, that in ten day’s, he should go to Paris, and from thence, send on to St petersburgh, to request you to join him there, and if he should, (as was expected,) be sent to England, that your Sons might be Sent to join their parents there. Altho Since the 27th of december, no letter has reached u

Now, inside this instance, we want to grab the original data. To do this, we can access properties.

In [46]:
response.objects[0].properties

{'chapter_title': 'From Abigail Smith Adams to Louisa Catherine Johnson Adams, 14 April 1815',
 'chunk_index': 0,
 'chunk': 'From Abigail Smith Adams to Louisa Catherine Johnson Adams, 14 April 1815\nQuincy April 14th 1815\nMy dear Daughter\nI address you, altho I know not where to find you, which is, and has been a source of much anxiety to me, four months have elapsed since the signature of the Treaty of Peace; when mr Adams wrote from Ghent, that in ten day’s, he should go to Paris, and from thence, send on to St petersburgh, to request you to join him there, and if he should, (as was expected,) be sent to England, that your Sons might be Sent to join their parents there. Altho Since the 27th of december, no letter has reached us from him. We have in compliance with his request, prepared the Children to meet you. \nI need not Say, how painfull there seperation from me will be. Age, infirmities, and many recent afflictions, which I have met, in the Death of many near and dear Friends

As we can see, we know that our response is found in Document 10. Let's print off the chunk to get a better look at our result.

In [48]:
print(response.objects[0].properties["chunk"])

From Abigail Smith Adams to Louisa Catherine Johnson Adams, 14 April 1815
Quincy April 14th 1815
My dear Daughter
I address you, altho I know not where to find you, which is, and has been a source of much anxiety to me, four months have elapsed since the signature of the Treaty of Peace; when mr Adams wrote from Ghent, that in ten day’s, he should go to Paris, and from thence, send on to St petersburgh, to request you to join him there, and if he should, (as was expected,) be sent to England, that your Sons might be Sent to join their parents there. Altho Since the 27th of december, no letter has reached us from him. We have in compliance with his request, prepared the Children to meet you. 
I need not Say, how painfull there seperation from me will be. Age, infirmities, and many recent afflictions, which I have met, in the Death of many near and dear Friends, and relatives, have broken me down, and give me little reason, to boast myself of tomorrow, as I know not, what a day may bring

And this document is the precise document that we wanted to find. But this really isn't RAG. This is just a vector-based search. What if we wanted to return a natural language response? To do that, we can pass one other argument: `grouped_task`. This will specify that we want a generated response that will do the following task on the group of results returned when we set the limit to 1.

In [55]:
response = founders.generate.near_text(
    query="What does Abigail Adams say about her sister's death?",
    limit=1,
    grouped_task="State what Abigail Adams says specifically about her sister's death. Provide as much detail as possible."
)

print(response.generated)

Abigail Adams specifically says that the sudden death of her dear and only sister the previous week has opened every wound afresh and caused her tears to flow anew. She explains that her sister went to bed on Saturday night feeling well except for a slight sore throat, but in the middle of the night she awoke feeling chilled and oppressed on her lungs. Despite calling for a doctor, her sister passed away peacefully and without the knowledge or suspicion of her attendants. Abigail describes her sister's life as a continued series of useful services, extending beyond just providing food and clothing to also shaping the minds, manners, and morals of the many youths under her care. She mentions that her sister's husband trusted her completely and considered her his pride, glory, and crown. Abigail expresses deep sorrow and loss over her sister's passing.


This natural language result is a summary of the letter itself. Unlike a traditional vector search, we have a natural language response to a question. Unlike a traditional LLM prompt, we have domain-specific and up-to-date knowledge that can be used to generate a better response. Hopefully, this gives you a sense of the value of RAG systems. But, we have only scratched the surface of what we can do with RAG. In the next notebook, we will learn how to create our own RAG systems.

# Solution to the Exercise

Because our dataset is a small sample, this solution allows us to find the specific letter. If we were working with the entire dataset, we would probably have multiple results here. Depending on the number of results, we may want to modify or refine this approach to eliminate false positives.

In [36]:
for i, item in enumerate(data):
    if "Adams, Abigail Smith" in item["authors"]:
        if "sister" in item["content"].lower():
            print(f"Document Number: {i}")
            print(item["content"])

Document Number: 10
Quincy April 14th 1815
My dear Daughter
I address you, altho I know not where to find you, which is, and has been a source of much anxiety to me, four months have elapsed since the signature of the Treaty of Peace; when mr Adams wrote from Ghent, that in ten day’s, he should go to Paris, and from thence, send on to St petersburgh, to request you to join him there, and if he should, (as was expected,) be sent to England, that your Sons might be Sent to join their parents there. Altho Since the 27th of december, no letter has reached us from him. We have in compliance with his request, prepared the Children to meet you.
I need not Say, how painfull there seperation from me will be. Age, infirmities, and many recent afflictions, which I have met, in the Death of many near and dear Friends, and relatives, have broken me down, and give me little reason, to boast myself of tomorrow, as I know not, what a day may bring forth, So that when they go from me it is with the pai