# Your First RAG Application

In this notebook, we'll walk you through each of the components that are involved in a simple RAG application.

We won't be leveraging any fancy tools, just the OpenAI Python SDK, Numpy, and some classic Python.

> NOTE: This was done with Python 3.11.4.

> NOTE: There might be [compatibility issues](https://github.com/wandb/wandb/issues/7683) if you're on NVIDIA driver >552.44 As an interim solution - you can rollback your drivers to the 552.44.

## Table of Contents:

- Task 1: Imports and Utilities
- Task 2: Documents
- Task 3: Embeddings and Vectors
- Task 4: Prompts
- Task 5: Retrieval Augmented Generation
  - 🚧 Activity #1: Augment RAG

Let's look at a rather complicated looking visual representation of a basic RAG application.

<img src="https://i.imgur.com/vD8b016.png" />

## Task 1: Imports and Utility

We're just doing some imports and enabling `async` to work within the Jupyter environment here, nothing too crazy!

In [1]:
!pip install -qU numpy matplotlib plotly pandas scipy scikit-learn openai python-dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
from aimakerspace.text_utils import TextFileLoader, CharacterTextSplitter
from aimakerspace.vectordatabase import VectorDatabase
import asyncio

In [3]:
import nest_asyncio
nest_asyncio.apply()

## Task 2: Documents

We'll be concerning ourselves with this part of the flow in the following section:

<img src="https://i.imgur.com/jTm9gjk.png" />

### Loading Source Documents

So, first things first, we need some documents to work with.

While we could work directly with the `.txt` files (or whatever file-types you wanted to extend this to) we can instead do some batch processing of those documents at the beginning in order to store them in a more machine compatible format.

In this case, we're going to parse our text file into a single document in memory.

Let's look at the relevant bits of the `TextFileLoader` class:

```python
def load_file(self):
        with open(self.path, "r", encoding=self.encoding) as f:
            self.documents.append(f.read())
```

We're simply loading the document using the built in `open` method, and storing that output in our `self.documents` list.


In [4]:
text_loader = TextFileLoader("data/PMarcaBlogs.txt")
documents = text_loader.load_documents()
len(documents)

1

In [5]:
print(documents[0][:100])


The Pmarca Blog Archives
(select posts from 2007-2009)
Marc Andreessen
copyright: Andreessen Horow


### Splitting Text Into Chunks

As we can see, there is one massive document.

We'll want to chunk the document into smaller parts so it's easier to pass the most relevant snippets to the LLM.

There is no fixed way to split/chunk documents - and you'll need to rely on some intuition as well as knowing your data *very* well in order to build the most robust system.

For this toy example, we'll just split blindly on length.

>There's an opportunity to clear up some terminology here, for this course we will be stick to the following:
>
>- "source documents" : The `.txt`, `.pdf`, `.html`, ..., files that make up the files and information we start with in its raw format
>- "document(s)" : single (or more) text object(s)
>- "corpus" : the combination of all of our documents

As you can imagine (though it's not specifically true in this toy example) the idea of splitting documents is to break them into managable sized chunks that retain the most relevant local context.

In [6]:
text_splitter = CharacterTextSplitter()
split_documents = text_splitter.split_texts(documents)
len(split_documents)

373

Let's take a look at some of the documents we've managed to split.

In [7]:
split_documents[0:1]

['\ufeff\nThe Pmarca Blog Archives\n(select posts from 2007-2009)\nMarc Andreessen\ncopyright: Andreessen Horowitz\ncover design: Jessica Hagy\nproduced using: Pressbooks\nContents\nTHE PMARCA GUIDE TO STARTUPS\nPart 1: Why not to do a startup 2\nPart 2: When the VCs say "no" 10\nPart 3: "But I don\'t know any VCs!" 18\nPart 4: The only thing that matters 25\nPart 5: The Moby Dick theory of big companies 33\nPart 6: How much funding is too little? Too much? 41\nPart 7: Why a startup\'s initial business plan doesn\'t\nmatter that much\n49\nTHE PMARCA GUIDE TO HIRING\nPart 8: Hiring, managing, promoting, and Dring\nexecutives\n54\nPart 9: How to hire a professional CEO 68\nHow to hire the best people you\'ve ever worked\nwith\n69\nTHE PMARCA GUIDE TO BIG COMPANIES\nPart 1: Turnaround! 82\nPart 2: Retaining great people 86\nTHE PMARCA GUIDE TO CAREER, PRODUCTIVITY,\nAND SOME OTHER THINGS\nIntroduction 97\nPart 1: Opportunity 99\nPart 2: Skills and education 107\nPart 3: Where to go and wh

## Task 3: Embeddings and Vectors

Next, we have to convert our corpus into a "machine readable" format as we explored in the Embedding Primer notebook.

Today, we're going to talk about the actual process of creating, and then storing, these embeddings, and how we can leverage that to intelligently add context to our queries.

### OpenAI API Key

In order to access OpenAI's APIs, we'll need to provide our OpenAI API Key!

You can work through the folder "OpenAI API Key Setup" for more information on this process if you don't already have an API Key!

In [8]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

### Vector Database

Let's set up our vector database to hold all our documents and their embeddings!

While this is all baked into 1 call - we can look at some of the code that powers this process to get a better understanding:

Let's look at our `VectorDatabase().__init__()`:

```python
def __init__(self, embedding_model: EmbeddingModel = None):
        self.vectors = defaultdict(np.array)
        self.embedding_model = embedding_model or EmbeddingModel()
```

As you can see - our vectors are merely stored as a dictionary of `np.array` objects.

Secondly, our `VectorDatabase()` has a default `EmbeddingModel()` which is a wrapper for OpenAI's `text-embedding-3-small` model.

> **Quick Info About `text-embedding-3-small`**:
> - It has a context window of **8191** tokens
> - It returns vectors with dimension **1536**

#### ❓Question #1:

The default embedding dimension of `text-embedding-3-small` is 1536, as noted above. 

1. Is there any way to modify this dimension?
2. What technique does OpenAI use to achieve this?

> NOTE: Check out this [API documentation](https://platform.openai.com/docs/api-reference/embeddings/create) for the answer to question #1, and [this documentation](https://platform.openai.com/docs/guides/embeddings/use-cases) for an answer to question #2!

## **!!!!! Answer #1   !!!!**

The default embedding dimension of `text-embedding-3-small` is 1536, as noted above. 

1. _Is there any way to modify this dimension?_
    
    (a) Yes, by using the dimensions parameter in the API call to OpenAI Embeddings model.
    
    (b) Alternatively, the embedding dimension can be dynamically altered after retrieving the default
        model embeddings (1536 for small and 3072 for large) by truncating the array representing the embedding
        down to a lower dimension - in this case, be sure to RE-normalize the array to be of unit length.




2. _What technique does OpenAI use to achieve this?_
    
    Matryoshka Representation Learning or MRL from a paper in 2022 allows the embedding pipeline to encode information
        at different levels of granularity.

We can call the `async_get_embeddings` method of our `EmbeddingModel()` on a list of `str` and receive a list of `float` back!

```python
async def async_get_embeddings(self, list_of_text: List[str]) -> List[List[float]]:
        return await aget_embeddings(
            list_of_text=list_of_text, engine=self.embeddings_model_name
        )
```

We cast those to `np.array` when we build our `VectorDatabase()`:

```python
async def abuild_from_list(self, list_of_text: List[str]) -> "VectorDatabase":
        embeddings = await self.embedding_model.async_get_embeddings(list_of_text)
        for text, embedding in zip(list_of_text, embeddings):
            self.insert(text, np.array(embedding))
        return self
```

And that's all we need to do!

In [9]:
vector_db = VectorDatabase()
vector_db = asyncio.run(vector_db.abuild_from_list(split_documents))

#### ❓Question #2:

What are the benefits of using an `async` approach to collecting our embeddings?

> NOTE: Determining the core difference between `async` and `sync` will be useful! If you get stuck - ask ChatGPT!

## **!!!! Answer #2 !!!!**

_What are the benefits of using an `async` approach to collecting our embeddings?_

async allows the code to gather all embeddings in a multi-tasking manner, leading to faster execution; rather than a 'sync' operation which would do this in sequence and be much slower.


So, to review what we've done so far in natural language:

1. We load source documents
2. We split those source documents into smaller chunks (documents)
3. We send each of those documents to the `text-embedding-3-small` OpenAI API endpoint
4. We store each of the text representations with the vector representations as keys/values in a dictionary

### Semantic Similarity

The next step is to be able to query our `VectorDatabase()` with a `str` and have it return to us vectors and text that is most relevant from our corpus.

We're going to use the following process to achieve this in our toy example:

1. We need to embed our query with the same `EmbeddingModel()` as we used to construct our `VectorDatabase()`
2. We loop through every vector in our `VectorDatabase()` and use a distance measure to compare how related they are
3. We return a list of the top `k` closest vectors, with their text representations

There's some very heavy optimization that can be done at each of these steps - but let's just focus on the basic pattern in this notebook.

> We are using [cosine similarity](https://www.engati.com/glossary/cosine-similarity) as a distance metric in this example - but there are many many distance metrics you could use - like [these](https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55)

> We are using a rather inefficient way of calculating relative distance between the query vector and all other vectors - there are more advanced approaches that are much more efficient, like [ANN](https://towardsdatascience.com/comprehensive-guide-to-approximate-nearest-neighbors-algorithms-8b94f057d6b6)

In [10]:
vector_db.search_by_text("What is the Michael Eisner Memorial Weak Executive Problem?", k=3)

[('ordingly.\nSeventh, when hiring the executive to run your former specialty, be\ncareful you don’t hire someone weak on purpose.\nThis sounds silly, but you wouldn’t believe how oaen it happens.\nThe CEO who used to be a product manager who has a weak\nproduct management executive. The CEO who used to be in\nsales who has a weak sales executive. The CEO who used to be\nin marketing who has a weak marketing executive.\nI call this the “Michael Eisner Memorial Weak Executive Problem” — aaer the CEO of Disney who had previously been a brilliant TV network executive. When he bought ABC at Disney, it\npromptly fell to fourth place. His response? “If I had an extra\ntwo days a week, I could turn around ABC myself.” Well, guess\nwhat, he didn’t have an extra two days a week.\nA CEO — or a startup founder — oaen has a hard time letting\ngo of the function that brought him to the party. The result: you\nhire someone weak into the executive role for that function so\nthat you can continue to b

## Task 4: Prompts

In the following section, we'll be looking at the role of prompts - and how they help us to guide our application in the right direction.

In this notebook, we're going to rely on the idea of "zero-shot in-context learning".

This is a lot of words to say: "We will ask it to perform our desired task in the prompt, and provide no examples."

### XYZRolePrompt

Before we do that, let's stop and think a bit about how OpenAI's chat models work.

We know they have roles - as is indicated in the following API [documentation](https://platform.openai.com/docs/api-reference/chat/create#chat/create-messages)

There are three roles, and they function as follows (taken directly from [OpenAI](https://platform.openai.com/docs/guides/gpt/chat-completions-api)):

- `{"role" : "system"}` : The system message helps set the behavior of the assistant. For example, you can modify the personality of the assistant or provide specific instructions about how it should behave throughout the conversation. However note that the system message is optional and the model’s behavior without a system message is likely to be similar to using a generic message such as "You are a helpful assistant."
- `{"role" : "user"}` : The user messages provide requests or comments for the assistant to respond to.
- `{"role" : "assistant"}` : Assistant messages store previous assistant responses, but can also be written by you to give examples of desired behavior.

The main idea is this:

1. You start with a system message that outlines how the LLM should respond, what kind of behaviours you can expect from it, and more
2. Then, you can provide a few examples in the form of "assistant"/"user" pairs
3. Then, you prompt the model with the true "user" message.

In this example, we'll be forgoing the 2nd step for simplicities sake.

#### Utility Functions

You'll notice that we're using some utility functions from the `aimakerspace` module - let's take a peek at these and see what they're doing!

##### XYZRolePrompt

Here we have our `system`, `user`, and `assistant` role prompts.

Let's take a peek at what they look like:

```python
class BasePrompt:
    def __init__(self, prompt):
        """
        Initializes the BasePrompt object with a prompt template.

        :param prompt: A string that can contain placeholders within curly braces
        """
        self.prompt = prompt
        self._pattern = re.compile(r"\{([^}]+)\}")

    def format_prompt(self, **kwargs):
        """
        Formats the prompt string using the keyword arguments provided.

        :param kwargs: The values to substitute into the prompt string
        :return: The formatted prompt string
        """
        matches = self._pattern.findall(self.prompt)
        return self.prompt.format(**{match: kwargs.get(match, "") for match in matches})

    def get_input_variables(self):
        """
        Gets the list of input variable names from the prompt string.

        :return: List of input variable names
        """
        return self._pattern.findall(self.prompt)
```

Then we have our `RolePrompt` which laser focuses us on the role pattern found in most API endpoints for LLMs.

```python
class RolePrompt(BasePrompt):
    def __init__(self, prompt, role: str):
        """
        Initializes the RolePrompt object with a prompt template and a role.

        :param prompt: A string that can contain placeholders within curly braces
        :param role: The role for the message ('system', 'user', or 'assistant')
        """
        super().__init__(prompt)
        self.role = role

    def create_message(self, **kwargs):
        """
        Creates a message dictionary with a role and a formatted message.

        :param kwargs: The values to substitute into the prompt string
        :return: Dictionary containing the role and the formatted message
        """
        return {"role": self.role, "content": self.format_prompt(**kwargs)}
```

We'll look at how the `SystemRolePrompt` is constructed to get a better idea of how that extension works:

```python
class SystemRolePrompt(RolePrompt):
    def __init__(self, prompt: str):
        super().__init__(prompt, "system")
```

That pattern is repeated for our `UserRolePrompt` and our `AssistantRolePrompt` as well.

##### ChatOpenAI

Next we have our model, which is converted to a format analagous to libraries like LangChain and LlamaIndex.

Let's take a peek at how that is constructed:

```python
class ChatOpenAI:
    def __init__(self, model_name: str = "gpt-4o-mini"):
        self.model_name = model_name
        self.openai_api_key = os.getenv("OPENAI_API_KEY")
        if self.openai_api_key is None:
            raise ValueError("OPENAI_API_KEY is not set")

    def run(self, messages, text_only: bool = True):
        if not isinstance(messages, list):
            raise ValueError("messages must be a list")

        openai.api_key = self.openai_api_key
        response = openai.ChatCompletion.create(
            model=self.model_name, messages=messages
        )

        if text_only:
            return response.choices[0].message.content

        return response
```

#### ❓ Question #3:

When calling the OpenAI API - are there any ways we can achieve more reproducible outputs?

> NOTE: Check out [this section](https://platform.openai.com/docs/guides/text-generation/) of the OpenAI documentation for the answer!

## !!!! Answer #3 !!!!

_When calling the OpenAI API - are there any ways we can achieve more reproducible outputs?_

Yes, by setting the temperature parameter to a low value (e.g., 0).


### Creating and Prompting OpenAI's `gpt-4o-mini`!

Let's tie all these together and use it to prompt `gpt-4o-mini`!

In [11]:
from aimakerspace.openai_utils.prompts import (
    UserRolePrompt,
    SystemRolePrompt,
    AssistantRolePrompt,
)

from aimakerspace.openai_utils.chatmodel import ChatOpenAI

chat_openai = ChatOpenAI()
user_prompt_template = "{content}"
user_role_prompt = UserRolePrompt(user_prompt_template)
system_prompt_template = (
    "You are an expert in {expertise}, you always answer in a kind way."
)
system_role_prompt = SystemRolePrompt(system_prompt_template)

messages = [
    system_role_prompt.create_message(expertise="Python"),
    user_role_prompt.create_message(
        content="What is the best way to write a loop?"
    ),
]

response = chat_openai.run(messages)

In [12]:
print(response)

In Python, the best way to write a loop often depends on your specific use case, but here are some general guidelines using the two most common types of loops: `for` loops and `while` loops.

### For Loop
A `for` loop is particularly useful when you want to iterate over a sequence (like a list, tuple, string, or any iterable). Here’s an example of iterating over a list:

```python
fruits = ['apple', 'banana', 'cherry']
for fruit in fruits:
    print(fruit)
```

### While Loop
A `while` loop is useful when you want to repeat an action until a condition changes. Here's an example:

```python
count = 0
while count < 5:
    print(count)
    count += 1
```

### Best Practices
1. **Use comprehensions when possible**: For simple iterations, list comprehensions can make your code cleaner and more Pythonic.

   ```python
   squares = [x**2 for x in range(10)]
   print(squares)
   ```

2. **Avoid infinite loops**: When using `while` loops, ensure the condition will eventually become `False`, or 

## Task 5: Retrieval Augmented Generation

Now we can create a RAG prompt - which will help our system behave in a way that makes sense!

There is much you could do here, many tweaks and improvements to be made!

In [13]:
RAG_PROMPT_TEMPLATE = """ \
Use the provided context to answer the user's query.

You may not answer the user's query unless there is specific context in the following text.

If you do not know the answer, or cannot answer, please respond with "I don't know".
"""

rag_prompt = SystemRolePrompt(RAG_PROMPT_TEMPLATE)

USER_PROMPT_TEMPLATE = """ \
Context:
{context}

User Query:
{user_query}
"""


user_prompt = UserRolePrompt(USER_PROMPT_TEMPLATE)

class RetrievalAugmentedQAPipeline:
    def __init__(self, llm: ChatOpenAI(), vector_db_retriever: VectorDatabase) -> None:
        self.llm = llm
        self.vector_db_retriever = vector_db_retriever

    def run_pipeline(self, user_query: str) -> str:
        context_list = self.vector_db_retriever.search_by_text(user_query, k=4)

        context_prompt = ""
        for context in context_list:
            context_prompt += context[0] + "\n"

        formatted_system_prompt = rag_prompt.create_message()

        formatted_user_prompt = user_prompt.create_message(user_query=user_query, context=context_prompt)

        return {"response" : self.llm.run([formatted_system_prompt, formatted_user_prompt]), "context" : context_list}

#### ❓ Question #4:

What prompting strategies could you use to make the LLM have a more thoughtful, detailed response?

What is that strategy called?

> NOTE: You can look through the Week 1 Day 1 "Prompting OpenAI Like A Developer" material for an answer to this question!

## !!!! Answer #4 !!!!

_What prompting strategies could you use to make the LLM have a more thoughtful, detailed response?_

_What is that strategy called?_

By providing clear and crisp guidance to the model via prompts.  Specifically, asking the model for a step-by-step reasoned response.  The technique is called Chain-of-Thought prompting.



In [14]:
retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(
    vector_db_retriever=vector_db,
    llm=chat_openai
)

In [15]:
retrieval_augmented_qa_pipeline.run_pipeline("What is the 'Michael Eisner Memorial Weak Executive Problem'?")

{'response': "The 'Michael Eisner Memorial Weak Executive Problem' refers to the tendency of a CEO or startup founder to hire an executive who is weak in the same area in which the CEO has expertise. This often happens because the CEO has difficulty letting go of the function that originally contributed to their success, leading to a situation where they can continue to dominate that area. The term is named after Michael Eisner, the former CEO of Disney, who struggled with the performance of ABC after its acquisition, yet believed he could turn it around if he had more time. The problem highlights the risks of hiring weak executives to maintain personal control, rather than focusing on strong leadership in those areas.",
 'context': [('ordingly.\nSeventh, when hiring the executive to run your former specialty, be\ncareful you don’t hire someone weak on purpose.\nThis sounds silly, but you wouldn’t believe how oaen it happens.\nThe CEO who used to be a product manager who has a weak\npr

### 🏗️ Activity #1:

Enhance your RAG application in some way! 

Suggestions are: 

- Allow it to work with PDF files
- Implement a new distance metric
- Add metadata support to the vector database

While these are suggestions, you should feel free to make whatever augmentations you desire! 

> NOTE: These additions might require you to work within the `aimakerspace` library - that's expected!

## **Summary of my work for Activity #1**

### I have implemented two types of RAG pipelines below.
Key features:
1.  Both RAG pipelines load information from pdf files
2.  In the first pipeline, I use a single pdf document - an academic paper in finance and pose a few high-level questions about the paper.
3.  In the second pipeline, I load two pdf documents on the same subject at different points in time.  Specifically, I load Warren Buffett's letter to shareholders from 2022 and 2023.  This requires me to also encode some metadata to help identify which year the information refers to.  I implement this in a very simple way - by encoding a phrase with the relevant year at the start of each chunk.  I am eager to see how this type of metadata can be handled in a more efficient way.
4.  Clearly, the second pipeline can be generalized to any number of documents on the same topic over time; a very realistic scenario in many applications...


Key steps implemented below
1.  Load each pdf doc and convert to text doc using pymupdf module.
2.  Split each text doc to get chunked text doc
3.  For the multi-document pipeline, add the metadata about the document to the start of each chunk string.  This is a short piece of text stating that this chunk is from {year}
4.  Merge the chunked strings for each doc into a single list of chunks
5.  make the vector db and proceed as usual

In [16]:
### YOUR CODE HERE

### Install necessary packages

In [17]:
!pip install pymupdf
# !pip install pypdf


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Imports

In [18]:
# from myutils.pypdf_to_text import TextFromPyPdf
from myutils.pymupdf_to_text import TextFromPyMuPdf

In [19]:
from IPython.display import display, Markdown

def pretty_print(message: str) -> str:
    display(Markdown(message))

### RAG on a Single PDF Document

#### Load pdf document, convert it to text, chunk the text and build vector database

In [20]:
list_of_pdfs = ['./data/LSV_value_jf_pub_1994.pdf']
chunk_size = 1000
chunk_overlap = 200

In [21]:
all_chunks = []
textfrompdf = TextFromPyMuPdf(list_of_pdf_docs=list_of_pdfs)
list_of_documents = textfrompdf.process_all_pdfs()[0]
text_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
split_documents = text_splitter.split_texts(list_of_documents)
print(f'completed chunking: final collection has {len(split_documents)} chunks')
all_chunks.extend(split_documents)
print(f'completed chunking for all docs: final corpus has {len(all_chunks)} chunks ')

completed chunking: final collection has 143 chunks
completed chunking for all docs: final corpus has 143 chunks 


In [22]:
my_vector_db = VectorDatabase()
my_vector_db = asyncio.run(my_vector_db.abuild_from_list(all_chunks))

#### Results from sample query to vector db

In [23]:
my_vector_db.search_by_text("What is value investing?", k=10)

[(' \nsimply equating a good investment with a well-run company irrespective of \nprice. Regardless of the reason, some investors tend to get overly excited \nabout stocks that have done very well in the past and buy them up, so that \nthese "glamour" stocks become overpriced. Similarly, they overreact to stocks \nthat have done very badly, oversell them, and these out-of-favor "value" \nstocks become underpriced. Contrarian investors bet against such naive \ninvestors. Because contrarian strategies invest disproportionately in stocks \nthat are underpriced and underinvest in stocks that are overpriced, they \noutperform the market (see De Bondt and Thaler (1985) and Haugen (1994)). \nAn alternative explanation of why value strategies have produced superior \nreturns, argued most forcefully by Fama and French (1992), is that they are \nfundamentally riskier. That is, investors in value stocks, such as high book- \nto-market stocks, tend to bear higher fundamental risk of some sort, and

#### System Role Prompt

In [24]:
RAG_PROMPT_TEMPLATE_TAIL = """\
{doc_specific_prompt}

Use ONLY the provided context to answer the user's query.

You may not answer the user's query unless there is specific context in the following text.

If you do not know the answer, or cannot answer, please respond with "I don't know".
"""

In [25]:
lsv_doc_specific_prompt = """
You are an expert in investment management and in the drivers of different stock market investing strategies.
You will be provided with a collection of papers on this topic.
The questions will relate to the content of these papers.

Take your time and think through the question step-by-step and provide your response to the question.

"""

In [26]:
LSV_RAG_PROMPT_TEMPLATE = RAG_PROMPT_TEMPLATE_TAIL.format(doc_specific_prompt=lsv_doc_specific_prompt)
lsv_rag_prompt = SystemRolePrompt(LSV_RAG_PROMPT_TEMPLATE)


#### User Role Prompt

In [27]:
USER_PROMPT_TEMPLATE = """ \
Context:
{context}

User Query:
{user_query}
"""

lsv_user_prompt = UserRolePrompt(USER_PROMPT_TEMPLATE)

In [28]:
class NewRetrievalAugmentedQAPipeline:
    def __init__(self, llm: ChatOpenAI(), vector_db_retriever: VectorDatabase) -> None:
        self.llm = llm
        self.vector_db_retriever = vector_db_retriever

    def run_pipeline(self, user_query: str) -> str:
        context_list = self.vector_db_retriever.search_by_text(user_query, k=10)

        context_prompt = ""
        for context in context_list:
            context_prompt += context[0] + "\n"

        formatted_system_prompt = lsv_rag_prompt.create_message()

        formatted_user_prompt = lsv_user_prompt.create_message(user_query=user_query, context=context_prompt)

        return {"response" : self.llm.run([formatted_system_prompt, formatted_user_prompt]), "context" : context_list}

In [29]:
my_retrieval_augmented_qa_pipeline = NewRetrievalAugmentedQAPipeline(
    vector_db_retriever=my_vector_db,
    llm=chat_openai
)

In [30]:
my_retrieval_augmented_qa_pipeline.run_pipeline("Why are value strategies successful?")

{'response': "Value strategies are considered successful for a couple of key reasons as highlighted in the context provided:\n\n1. **Contrarian Investment**: Value strategies often include a contrarian approach that exploits the mistakes of naive investors who tend to overreact to past performance. Investors might overly favor 'glamour' stocks—those that have performed well recently—leading to their overpricing. Conversely, 'value' stocks—those that have underperformed—may become underpriced. Contrarian investors leverage this by investing in these underpriced stocks, thereby potentially leading to superior returns.\n\n2. **Behavioral Mispricing**: Some research suggests that the higher expected returns from value strategies are due to behavioral errors in the market. For instance, investors may extrapolate past performance too far into the future, leading to mispricing opportunities that value strategies can capitalize on.\n\n3. **Less Evidence of Greater Risk**: The context also indi

In [31]:
gpt_response = \
    my_retrieval_augmented_qa_pipeline.run_pipeline("Why are value strategies successful?")

In [32]:
pretty_print(gpt_response['response'])

Value strategies are considered successful for a couple of primary reasons, based on the context provided:

1. **Contrarian Behavior**: Value strategies may exploit the mistakes made by naive investors who tend to overreact to stock performance. Investors often become overly excited about stocks that have performed well (glamour stocks), thereby driving their prices up and making them overpriced. Conversely, they may overreact to poorly performing stocks, causing these value stocks to become underpriced. Contrarian investors take advantage of these discrepancies by investing disproportionately in underpriced stocks and underinvesting in overpriced stocks, resulting in superior market performance.

2. **Risk Considerations**: An alternative explanation suggested by researchers such as Fama and French is that value strategies involve greater fundamental risk. Investors in value stocks, like those with high book-to-market ratios, may bear higher risk, and thus their higher average returns could simply be a compensation for this risk. However, the analysis presented indicates little evidence supporting the notion that value strategies are fundamentally riskier than glamour strategies.

In summary, the success of value strategies can be attributed to their contrarian nature, capitalizing on investor behavioral biases, and potentially higher returns as compensation for increased risk, although the evidence for the latter is inconclusive.

In [33]:
my_retrieval_augmented_qa_pipeline.run_pipeline("What characteristics can an investor use to identify value stocks?")

{'response': 'An investor can identify value stocks using several characteristics, including:\n\n1. **Low Prices Relative to Earnings**: This can be evaluated through low earnings-to-price ratios.\n  \n2. **Low Prices Relative to Dividends**: Investors can look for stocks with low dividend yields relative to their price.\n\n3. **Low Prices Relative to Historical Prices**: Price comparisons to historical values could indicate undervaluation.\n\n4. **High Book-to-Market Ratio**: Stocks with a high ratio of book value to market value are often considered value stocks.\n\n5. **High Cash Flow-to-Price Ratio**: A high ratio of cash flow relative to price can also predict higher returns.\n\n6. **Out-of-Favor Status**: Value stocks are typically those that have been overlooked or underpriced in the market compared to glamour stocks.\n\nThese characteristics are supported by empirical evidence indicating that value strategies have produced superior returns over time.',
 'context': [(' \nsimply 

In [34]:
gpt_response = \
    my_retrieval_augmented_qa_pipeline.run_pipeline("What characteristics can an investor use to identify value stocks?")

In [35]:
pretty_print(gpt_response['response'])

Investors can identify value stocks using several characteristics, including:

1. **Low Prices Relative to Earnings**: Investing in stocks with low price-to-earnings (P/E) ratios.

2. **Large Book-to-Market Ratios**: Stocks with high book values relative to their market prices tend to be categorized as value stocks.

3. **High Cash Flow-to-Price Ratios**: A high ratio of cash flow to price can predict higher returns.

4. **Low Prices Relative to Other Measures of Value**: This includes low prices relative to dividends, historical prices, or other value metrics.

5. **Out-of-Favor Status**: Identifying stocks that have underperformed in the market and are currently out of favor can also signal value opportunities.

These characteristics help investors differentiate value stocks from glamour stocks, which are often priced based on the expectation of future growth that may not materialize.

In [36]:
my_retrieval_augmented_qa_pipeline.run_pipeline("What is the cumulative return to value over the past several decades?")

{'response': 'The cumulative return to value stocks outperformed glamour stocks by 90 percent over Years 1 through 5 after formation.',
 'context': [(' Table I, we present the returns for Years 1 through 5 after \nthe formation (R1 through R5), the average annual 5-year return (AR), the \ncumulative 5-year return (CR5), and the size-adjusted average annual 5-year \nreturn (SAAR). The numbers presented are -the averages across all formation \nperiods in the sample. The results confirm and extend the results established \nby Rosenberg, Reid, and Lanstein (1984), Chan, Hamao, and Lakonishok \n(1991), and Fama and French (1992). On average over the postformation \nyears, the low B/M (glamour) stocks have an average annual return of 9.3 \npercent and the high B/M (value) stocks have an average annual return of \n19.8 percent, for a difference of 10.5 percent per year. If portfolios are held \nwith the limited rebalancing described above, then cumulatively value stocks \noutperform glamour s

In [37]:
gpt_response = \
    my_retrieval_augmented_qa_pipeline.run_pipeline("What is the cumulative return to value over the past several decades?")

In [38]:
pretty_print(gpt_response['response'])

The cumulative return to value stocks over the Years 1 through 5 is 90 percent, as value stocks outperform glamour stocks during this period.

### RAG on Multiple PDF Documents

#### Load pdf documents, convert each to text, chunk the texts, add metadata as a string and build vector database

In [39]:
chunk_prefix = 'Warren Buffett letter in year {year} '

In [40]:
list_of_years = [
    '2022',
    '2023'
]

list_of_pdfs = ['data/' + x + 'ltr.pdf' for x in list_of_years]

chunk_size = 1000
chunk_overlap = 200

In [41]:
all_chunks = []

for year in list_of_years:
    pdffile = './data/' + year + 'ltr.pdf'
    textfrompdf = TextFromPyMuPdf(list_of_pdf_docs=[pdffile])
    list_of_documents = textfrompdf.process_all_pdfs()[0]
    text_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    split_documents = text_splitter.split_texts(list_of_documents)
    split_documents = [chunk_prefix.format(year=year) + ' ' + x for x in split_documents]
    print(f'completed chunking for year {year} doc: it has {len(split_documents)} chunks')
    all_chunks.extend(split_documents)
print(f'completed chunking for all docs across all years: final corpus has {len(all_chunks)} chunks ')

completed chunking for year 2022 doc: it has 39 chunks
completed chunking for year 2023 doc: it has 64 chunks
completed chunking for all docs across all years: final corpus has 103 chunks 


In [42]:
my_vector_db = VectorDatabase()
my_vector_db = asyncio.run(my_vector_db.abuild_from_list(all_chunks))

#### Results from sample query to vector db

In [43]:
my_vector_db.search_by_text("What does Warren Buffett say about Charlie Munger?", k=10)

[('Warren Buffett letter in year 2023  \n Charlie Munger – The Architect of Berkshire Hathaway  \nCharlie Munger died on November 28, just 33 days before his 100th birthday. \nThough born and raised in Omaha, he spent 80% of his life domiciled \nelsewhere. Consequently, it was not until 1959 when he was 35 that I first met him. \nIn 1962, he decided that he should take up money management. \nThree years later he told me – correctly! – that I had made a dumb decision in \nbuying control of Berkshire. But, he assured me, since I had already made the move, \nhe would tell me how to correct my mistake. \nIn what I next relate, bear in mind that Charlie and his family did not have a \ndime invested in the small investing partnership that I was then managing and whose \nmoney I had used for the Berkshire purchase. Moreover, neither of us expected that \nCharlie would ever own a share of Berkshire stock. \nNevertheless, Charlie, in 1965, promptly advised me: “Warren, forget about \never buyin

#### System Role Prompt

In [44]:
RAG_PROMPT_TEMPLATE_TAIL = """\
{doc_specific_prompt}

Use ONLY the provided context to answer the user's query.

You may not answer the user's query unless there is specific context in the following text.

If you do not know the answer, or cannot answer, please respond with "I don't know".
"""

In [45]:
bh_doc_specific_prompt = """
You are an expert in understanding letters written by CEOs of major financial corporations to their shareholders.
These letters cover a lot of ground about the high-level activities of the company, the economic and other challenges faced by the company during that year and potential opportunities for growth in the future, etc.

You will be provided with a set of annual letters written by Warren Buffett to 
the shareholders of Berkshire Hathaway.  The questions will relate to his commentary in these letters.

Take your time and think through the question step-by-step and provide your response to the question.

"""

In [46]:
BH_RAG_PROMPT_TEMPLATE = RAG_PROMPT_TEMPLATE_TAIL.format(doc_specific_prompt=bh_doc_specific_prompt)
bh_rag_prompt = SystemRolePrompt(BH_RAG_PROMPT_TEMPLATE)


#### User Role Prompt

In [47]:
USER_PROMPT_TEMPLATE = """ \
Context:
{context}

User Query:
{user_query}
"""

bh_user_prompt = UserRolePrompt(USER_PROMPT_TEMPLATE)

In [48]:
class NewRetrievalAugmentedQAPipeline:
    def __init__(self, llm: ChatOpenAI(), vector_db_retriever: VectorDatabase) -> None:
        self.llm = llm
        self.vector_db_retriever = vector_db_retriever

    def run_pipeline(self, user_query: str) -> str:
        context_list = self.vector_db_retriever.search_by_text(user_query, k=10)

        context_prompt = ""
        for context in context_list:
            context_prompt += context[0] + "\n"

        formatted_system_prompt = bh_rag_prompt.create_message()

        formatted_user_prompt = bh_user_prompt.create_message(user_query=user_query, context=context_prompt)

        return {"response" : self.llm.run([formatted_system_prompt, formatted_user_prompt]), "context" : context_list}

In [49]:
my_retrieval_augmented_qa_pipeline = NewRetrievalAugmentedQAPipeline(
    vector_db_retriever=my_vector_db,
    llm=chat_openai
)

In [50]:
my_retrieval_augmented_qa_pipeline.run_pipeline("What does Warren Buffett say about Charlie Munger?")

{'response': 'Warren Buffett shares a heartfelt tribute to Charlie Munger, highlighting his significant impact on Berkshire Hathaway and their partnership. He describes Munger as the "architect" of the present Berkshire, while he himself acted as the "general contractor" to carry out Munger\'s vision. Buffett recounts their first meeting in 1959 and how Munger advised him against buying another company like Berkshire, encouraging him instead to focus on acquiring wonderful businesses at fair prices.\n\nBuffett acknowledges Munger\'s sharp intellect and clarity in reasoning, noting that he often delivers insights more succinctly and artfully than Buffett himself can. Throughout their long partnership, Munger has helped Buffett stay grounded and return to rational thinking when Buffett\'s old habits resurged. Buffett appreciates Munger\'s straightforwardness and the way he challenges his thinking, describing their relationship as one without conflicts and characterized by mutual respect 

In [51]:
gpt_response = \
    my_retrieval_augmented_qa_pipeline.run_pipeline("What does Warren Buffett say about Charlie Munger?")

In [52]:
pretty_print(gpt_response['response'])

Warren Buffett describes Charlie Munger as a significant influence and partner in running Berkshire Hathaway. He refers to Charlie as the "architect" of the present Berkshire, highlighting that while he (Buffett) acted as the "general contractor," Munger provided essential guidance and ideas that fundamentally shaped their investment philosophy. Buffett recounts his early interactions with Munger and how Munger correctly advised him on investment strategies, steering him away from buying other companies like Berkshire and instead focusing on acquiring wonderful businesses at fair prices. He appreciates Munger's clarity in reasoning and his ability to challenge Buffett's thinking, stating that Munger has always brought him back to sanity when his old habits emerge. Buffett also emphasizes the deep mutual respect and understanding in their partnership, describing their relationship as one that has never featured conflict or shouting matches.

In [53]:
my_retrieval_augmented_qa_pipeline.run_pipeline("Did Berkshire Hathaway face any major challenges in 2023?")

{'response': 'Yes, Berkshire Hathaway faced challenges in 2023, particularly that most of its non-insurance businesses experienced lower earnings. However, this decline was expected to be cushioned by decent results from its two largest non-insurance businesses, BNSF and Berkshire Hathaway Energy, which together accounted for more than 30% of operating earnings in 2022.',
 'context': [('Warren Buffett letter in year 2023  . . . . . . .\n14,937 \n14,549 \nOperating earnings . . . . . . . . . . . . . . . . . . . . . . . . . .\n$37,350 \n$30,853 \nAt Berkshire’s annual gathering on May 6, 2023, I presented the first quarter’s results \nwhich had been released early that morning. I followed with a short summary of the outlook for \nthe full year: (1) most of our non-insurance businesses faced lower earnings in 2023; (2) that \ndecline would be cushioned by decent results at our two largest non-insurance businesses, BNSF \nand Berkshire Hathaway Energy (“BHE”) which, combined, had accounted

In [54]:
gpt_response = \
    my_retrieval_augmented_qa_pipeline.run_pipeline("Did Berkshire Hathaway face any major challenges in 2023?")

In [55]:
pretty_print(gpt_response['response'])

Yes, Berkshire Hathaway faced challenges in 2023, as indicated by Warren Buffett's commentary. Most of the non-insurance businesses experienced lower earnings, which would be somewhat cushioned by the performance of its two largest non-insurance businesses, BNSF and Berkshire Hathaway Energy. Additionally, there were indications of a prolonged period of global economic weakness and fear, which could impact performance.

In [56]:
my_retrieval_augmented_qa_pipeline.run_pipeline("Compare Berkshire Hathaway's report for 2022 and 2023?")

{'response': "In comparing Berkshire Hathaway's report for 2022 and 2023, several key financial figures and operational insights can be observed:\n\n1. **Operating Earnings**:\n   - In 2022, Berkshire reported operating earnings of $30,853 million.\n   - In 2023, operating earnings increased to $37,350 million, indicating a significant growth year-over-year.\n\n2. **Individual Business Performance**:\n   - **Insurance Underwriting**:\n     - 2022: ($30) million (loss)\n     - 2023: $5,428 million (profit), showcasing a substantial turnaround in the insurance underwriting segment.\n   - **Insurance Investment Income**:\n     - 2022: $6,484 million\n     - 2023: $9,567 million, indicating an increase in investment income.\n   - **Railroad**:\n     - 2022: $5,946 million\n     - 2023: $5,087 million, showing a decrease in earnings from this segment.\n   - **Utilities and Energy**:\n     - 2022: $3,904 million\n     - 2023: $2,331 million, reflecting a decline in earnings as well.\n   - **

In [57]:
gpt_response = \
    my_retrieval_augmented_qa_pipeline.run_pipeline("Compare Berkshire Hathaway's report for 2022 and 2023?")

In [58]:
pretty_print(gpt_response['response'])

Based on the provided context, here's a comparison of Berkshire Hathaway's report for 2022 and 2023:

1. **Operating Earnings**:
   - **2022**: $30,853 million
   - **2023**: $37,350 million
   - There was an increase in operating earnings from 2022 to 2023.

2. **Insurance-Underwriting**:
   - **2022**: $(30) million
   - **2023**: $5,428 million
   - The insurance-underwriting component showed a significant turnaround from a loss in 2022 to a substantial profit in 2023.

3. **Insurance-Investment Income**:
   - **2022**: $6,484 million
   - **2023**: $9,567 million
   - There was a noticeable increase in insurance-investment income in 2023.

4. **Railroad**:
   - **2022**: $5,946 million
   - **2023**: $5,087 million
   - Operating earnings from the railroad segment decreased from 2022 to 2023.

5. **Utilities and Energy**:
   - **2022**: $3,904 million
   - **2023**: $2,331 million
   - There was a sharp decline in operating earnings from utilities and energy.

6. **Other Businesses and Miscellaneous Items**:
   - **2022**: $14,549 million
   - **2023**: $14,937 million
   - This segment saw a slight increase in earnings.

7. **Overall Outlook**:
   - In 2023, Buffett noted that most of the non-insurance businesses faced lower earnings compared to 2022. This overall decline was offset by the strong performance of Berkshire's largest non-insurance businesses, namely BNSF and Berkshire Hathaway Energy.

In summary, Berkshire Hathaway reported an overall increase in operating earnings in 2023 compared to 2022, driven largely by a remarkable recovery in insurance-underwriting and increased insurance-investment income, despite declines in railroad and utility earnings.

In [59]:
my_retrieval_augmented_qa_pipeline.run_pipeline("Compare Warren Buffett's overall sentiment about Berkshire Hathaway in 2022 and 2023")

{'response': 'In Warren Buffett\'s letters to shareholders for 2022 and 2023, his overall sentiment about Berkshire Hathaway reflects a shift in outlook regarding the company\'s performance and future prospects.\n\nIn 2022, Buffett expressed a sense of achievement regarding the value created through acquisitions and the growth of Berkshire\'s insurance float. He highlighted the successful acquisition of Alleghany Corporation and noted the positive impact of disciplined underwriting. Despite acknowledging market challenges, he conveyed a sense of confidence in Berkshire\'s resilience and its ability to outperform the average American corporation while operating with lower risk.\n\nIn contrast, the 2023 letter presents a more cautious tone. Buffett indicated that most non-insurance businesses at Berkshire were likely to face lower earnings for the year. While he pointed out that the decline would be cushioned by solid results from BNSF and Berkshire Hathaway Energy, he acknowledged the c

In [60]:
gpt_response = \
    my_retrieval_augmented_qa_pipeline.run_pipeline("Compare Warren Buffett's overall sentiment about Berkshire Hathaway in 2022 and 2023")

In [61]:
pretty_print(gpt_response['response'])

In 2023, Warren Buffett expresses a generally optimistic yet cautious sentiment about Berkshire Hathaway. He highlights that while most non-insurance businesses faced lower earnings, significant earnings from BNSF and Berkshire Hathaway Energy will cushion the decline. He acknowledges the growth in investment income due to a favorable position in U.S. Treasury bills and anticipates good performance from the insurance segment. Buffett also notes the positive long-term prospects of the businesses owned by Berkshire, enhanced by a stable group of long-time managers, suggesting a cautious optimism about the company's adaptability and resilience.

In contrast, the 2022 letter reflects a more defensive stance, discussing challenges such as the complexity of managing capital deployment and acknowledging that achieved performance may not be eye-popping due to increased competition and fewer attractive investment opportunities. While he speaks of significant gains in float and strategic purchases like Alleghany Corporation, the tone is more subdued, recognizing the difficulties faced by the company and the broader market environment.

Overall, the sentiment in 2023 is one of tempered optimism with a focus on the company's strengths and strategic positioning, while the sentiment in 2022 appears more cautious and reflective on challenges within the business landscape.

## Summary of Key Learnings

1.  Results are sensitive to chunk size, chunk overlap parameters and number of retrieved chunks that are stuffed into the context.
2.  Use of Chain-of-Thought prompting led to more frequent full responses.  This was especially true for the multi-document pipeline, especially when the question needed reasoning across multiple documents.  Without CoT, the LLM frequently tended to reply "I don't know".
3.  The pdf loaders I tried to use - pypdf and pymupdf - were unable to load tables from the pdf documents.  This may well be an issue with the pdf file format than with the loaders, but it was still an important learning for me.

## Summary of Key Questions and Areas to Investigate Further

1.  Are there "automated" ways to verify the accuracy of the responses from a RAG system?  I found it a bit tedious to read through the top-K extracted chunks that were stuffed into the prompt as the context to 'quasi-confirm' if indeed the compiled response was solely based on the context.
2.  How do large-scale RAG systems incorporate metadata (e.g., the year corresponding to document, etc.) into the information saved and retrieved?  I used a very simple "kludgy" approach of a short piece of text at the start of each chunk, but this (a) does not seem efficient and (b) may not be sufficiently precise in distinguishing the chunk as belonging to a particular year, etc.
3.  Is there a way to automatically detect or get a warning when the length of the prompt approaches or exceeds the maximum context length for the LLM?  Is the user required to pre-process each query/context for its length or does the LLM raise a warning if the context length is exceeded?