# Intro

An LLM in isolation knows only what it has been trained on, which doesn't include your personal data, proprietary data, or public articles that were written after the LLM was trained.

However, using certain techniques, it is possible to have a conversation with your own documents and an LLM.

This notebook illustrates these techniques using the [LangChain](https://www.langchain.com/) framework.

| NOTE: |
| :---- |
| LangChain API changes a lot, sometimes with non-backwards compatible changes. This guide has been updated to conform to `0.1.11` version. |

## LangChain

[LangChain](https://www.langchain.com/) is an OSS development framework for building LLM applications using Python or TypeScript.

It's focused on composition and modularity. It also provides support for common use cases so that you can apply certain techniques in a very easy way.

LangChain main components include:

| LangChain Component | Capabilities |
| :------------------ | :----------- |
| **Prompts** | Templates and implementations. |
| **Models**  | Integration with LLMs, chat models, etc. |
| **Indexes** | Document loaders, text splitters, integration with vector stores, and retrievers. |
| **Chains** | Chaining LLMs for complex applications. |
| **Agents** | Reasoning engine to determine which actions should be taken. |

## Setting up shop

The Jupyter notebook uses Poetry for the dependency management. 

The `pyproject.toml` has been set up for you, so you just need to run:

```bash
poetry install
```

This will create a virtual environment and install all the required dependencies to run the notebook cells.

| NOTE: |
| :---- |
| The project has been configured with `package-mode = false` which tells Poetry that `pyproject.toml` is used only for dependency management. |

Once installed, you will need to select the newly created virtual environment created by Poetry as your kernel in VSCode.

![VSCode Poetry: Kernel configuration](pics/vscode-poetry-configuration.png)

## Retrieval augmented generation

In the technique known as Retrieval Augmented Generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful when you want to ask questions about specific documents (e.g., your PDFs, a set of videos, etc.).

The following diagram depicts the different steps involved in this technique.

![RAG stages (High Level)](pics/rag_stages_hl.png)

Let's start by setting up the environment so that we can interact with the LLM.

We will be consuming OpenAI capabilities through Azure OpenAI service.

For starters, we will need to identify:

+ `openai.api_type` &mdash; identifies the API type between the native OpenAI one (`"openai"`) or the Azure OpenAI one (`"azure"`).

+ `openai.api_version` &mdash; the API spec to use when interacting with the service. The allowed values are documented in [Azure OpenAI reference](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference)

+ `openai.api_base` &mdash; the endpoint in which OpenAI is accepting requests. You can find this value in Azure Portal (see below).

+ `openai.api_key` &mdash; the corresponding API key for the endpoint. You can find this value in Azure Portal (see below).

![Azure Portal: endpoints](pics/azure-openai-endpoint.png)

In order not to expose the keys in the source code and to foster a more flexible configuration, we will be using [`python-dotenv`](https://pypi.org/project/python-dotenv/).

This module will let you use a `.env` file in which you will be able to configure all the needed pieces of data to connect to Azure OpenAI:

```INI
AZURE_OPENAI_ENDPOINT = https://....openai.azure.com/
AZURE_OPENAI_API_KEY = ...
```

Along with some other application specific configuration parameters:

```INI
OPENAI_API_VERSION = 2023-05-15
AZURE_OPENAI_TEXT_EMBEDDING_DEPLOYMENT_NAME = ada
...
```

In [1]:
import os

from dotenv import load_dotenv

load_dotenv()


True

## Step 1: Document Loading

![RAG: step 1](pics/rag_step_1.png)

The first step of the RAG approach is the loading of documents.

The sources can be of many different types according to:

+ How we access them:
    + Local or remote file systems
    + Web sites
    + Databases
    + Video sites (such as YouTube)
    + ...

+ The data types of the obtained information:
    + PDF
    + HTML
    + JSON
    + Word, PowerPoint, etc.
    + ...


When performing this step with LangChain, a list of `Document` objects will be returned, with each object having both `page_contents` and `metadata` attributes.

### Loading a PDF

The following snippet illustrates how to load a specific PDF containing a transcript from a Machine Learning training course:

In [2]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf")
pages = loader.load()

The PDF will be loaded as a list of `Document` objects, with each `Document` representing a page of the document:

In [3]:
# number of documents and number of pages will be the same
len(pages)

22

Each object will feature the extracted text in the `page_content` attribute and the contextual metadata information in the `metadata` attribute.

In [4]:
page = pages[0]
print(page.page_content[0:200])

MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 



In [5]:
page.metadata

{'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf', 'page': 0}

For more information on working with PDFs see [PDF](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf) section on LangChain documentation.

### Loading a YouTube video

The following snippet illustrates how to download a YouTube video, transcribe it, and then load it into a list of `Document` objects.

| NOTE: |
| :---- |
| This example does not work with AzureOpenAI. |


In [None]:
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers.audio import OpenAIWhisperParser
from langchain_community.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

url = "https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_path = "data/youtube"

loader = GenericLoader(
    YoutubeAudioLoader([url], save_path),
    OpenAIWhisperParser()
)
docs = loader.load()

### Loading information from a URL

LangChain also provides loaders for websites.

In [41]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/semver/semver/blob/master/semver.md?plain=1")

docs = loader.load()

docs[0].page_content[0:100]

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nsemver/semver.md at mast'

## Step 2: Splitting

![RAG: Step 2](pics/rag_step_2.png)

In this second step, you need to split each `Document` object into smaller chunks in such a way that meaningful relationships are retained.

For example, if we have a document with the content:
> ...on this model. The Toyota Camri has a head-snapping 80 HP and an eight speed automatic transmission that will...

That is split into the following chunks

| Chunk # | Content |
| :------ | :------ |
| 1 | on this model. The Toyota Camri has a head-snapping |
| 2 | 80 HP and an eight speed automatic transmission that will |

If we're asked: "What are the specifications on the Camry?", we won't have the answer on either chunk.

The following snippet illustrates how a LangChain splitter is invoked:

```python
langchain.text_splitter.CharacterTextSplitter(
  separator="\n\n",
  chunk_size=4000,
  chunk_overlap=200,  # overlap window
  length_function=<built-in len function>
)
```

And the following picture illustrate those concepts:

![Chunks](pics/chunks.png)

TODO: Get the methods used.

There are a few key methods exposed on the splitter object:
+ `create_documents()` &mdash; create documents from a list of texts.
+ `split_documents()` &mdash;
+ `split_text()` &mdash;

And there are different types of splitters, all defined in the `langchain.text_splitter` module:

| Splitter | Description |
| :------- | :---------- |
| `CharacterTextSplitter` | Split text by characters. |
| `MarkdownHeaderTextSplitter` | Split a markdown file based on its headers. |
| `TokenTextSplitter` | Split text looking at its tokens. |
| `SentenceTransformersTokenTextSplitter`| Split text to tokens using a sentence tokenizer model. |
| `RecursiveCharacterTextSplitter` | Split text by looking at characters and recursively tries to split by different characters to find a way to split it that works. |
| `Language` | Split programming languages text. |
| `NLTKTextSplitter` | Split text by looking at sentences using [NLTK](https://www.nltk.org/). |
| `SpacyTextSplitter` | Split text by looking at sentences using [spaCy](https://github.com/explosion/spaCy). |


Splitters take care of maintaining consistent metadata across chunks.

Let's develop our intuition about how the splitters work by using two of the most common text splitters: `CharacterTextSplitter` and `RecursiveCharacterTextSplitter`.

In [2]:
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
)

Then, we set a very small chunk size and chunk overlap to see how the splitters behave.

In [3]:
chunk_size = 26
chunk_overlap = 4

Now we instantiate the splitters:

In [4]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Let's now define some sample strings and see how the splitters behave:

In [45]:
text1 = "abcdefghijklmnopqrstuvwxyz"
r_splitter.split_text(text1)


['abcdefghijklmnopqrstuvwxyz']

Note that the string is not split, because the chunk size was set to 26, and the whole text fits in a single chunk.

If we try now with a slightly longer string:

In [46]:
text2 = "abcdefghijklmnopqrstuvwxyzabcdefg"
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

Now two chunks are created:
+ First chunk is 26 chars long and ends in "z".
+ Second chunk is 11 chars long, it begins with "wxyz" because we told the text splitter to use a chunk overlap of 4. It ends with the last chars of the input text.

Let's now use a longer string in which characters are separated by spaces:

In [47]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In this case we get three chunks:
+ first chunk is 25 chars long. Note that the chunk ends in "m" (i.e., the last space is dropped).
+ second chunk is is also 25 chars long. It starts with "l m " because of the specified chunk overlap.
+ third chunk is 7 characters long, and it also starts with the last two characters from the previous chunk.

For those first set of tests we've used the `RecursiveCharacterTextSplitter` which is the recommended way to start splitting text.

Alternatively, we can use the `CharacterTextSplitter`, which splits text based on a user defined character (separator). Let's see how it works with the same examples:

In [51]:
text1 = "abcdefghijklmnopqrstuvwxyz"
c_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

The character splitter behaves in the same way as the recursive character splitter because the input text fits in one chunk.

For the second test:

In [52]:
text2 = "abcdefghijklmnopqrstuvwxyzabcdefg"
c_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyzabcdefg']

Note that despite having set the chunk size to 26, the character text splitter returns a single chunk of 33 characters.

It doesn't perform any splitting, because by default, the separator for the `CharacterTextSplitter` is set to a newline.

If we test it against the third string:

In [5]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
c_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

Again, the string remains in one chunk because the splitter doesn't find the separator.

Let's change the definition of the `CharacterTextSplitter` so that it uses `" "` as the separator:

In [7]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator=" "
)

text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

Now we get three chunks as the `RecursiveCharacterTextSplitter` does.

| NOTE: |
| :---- |
| The behavior of the `RecursiveCharacterTextSplitter` differs from the `CharacterTextSplitter` in that when it cannot split text based on the configured separator it will use `["\n\n", "\n", " ", ""]`, which means it'll try to split by `"\n\n"`, then by `"\n"`, then by `" "`, and lastly character by character (empty string). |

Let's now use more "real-world" examples starting with a long text paragraph:

In [9]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""


print(f"{len(some_text)=} characters")

len(some_text)=496 characters


Note that the text contains two long paragraphs separated by `"\n\n"`.

Let's configure the splitters to use `" "` as the separator for the character text splitter and `["\n\n", "\n", " ", ""]` for the recursive one.

In [12]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator=" "
)

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separators=["\n\n", "\n", " ", ""]
)

Let's start by reviewing the result of splitting with the regular text splitter:

In [13]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

You see that it creates two chunks, but with no meaningful separation between them.

By contrast, the recursive splitter does a much better job:

In [14]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

See that chunks are created by splitting the text into paragraphs.

Let's now use a smaller chunk size on the recursive splitter:

In [16]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", " ", ""]
)

r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

The result is better than the character splitter, but we see that some of the sentences have been split in half.

Let's also include the `"."` as separator:

In [17]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", ".", " ", ""]
)

r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related",
 '. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns',
 '. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space',
 '.and words are separated by space.']

It's better, but not optimal yet, as we see that `"."` are placed at the beginning of the chunks.

Let's now use a more contrived sentence separator using a regular expression. Note that we will need to inform the `RecursiveCharacterTextSplitter` that we have started to use regex:

In [19]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""],
    is_separator_regex=True
)

r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related.",
 'For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns.',
 'Carriage returns are the "backslash n" you see embedded in this string.',
 'Sentences have a period at the end, but also, have a space.and words are separated by space.']

Now we have much better chunks, without sentences split by half, and with each chunk starting with a proper sentence.

In summary:
+ `RecursiveCharacterTextSplitter` seems to do a much better job than the `CharacterTextSplitter`.
+ We might need to work on a good separator strategy for our text so that:
  + Paragraphs are not split in half.
  + Sentences are not split in half.
  + Chunks start and end in proper sentences, so that they are meaningful on their own.

### Using splitters with real-world documents

Now that we have some intuition about how splitters work, we can start using them on real-world documents.

Let's start by loading a PDF and using the `CharacterTextSplitter` with some basic configuration and a `"\n"` separator:

In [22]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf")
pages = loader.load()

In [23]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separator="\n",
    length_function=len
)

Because now we're dealing with the `list[Document]` returned by `load()`, we'll need to use `split_documents()` on the splitter:

In [24]:
docs = text_splitter.split_documents(pages)

`split_documents()` also returns a `list[Document]`. Each of the resulting `Document` object will contain a small portion of each of the original pages returned by `load()`:

In [25]:
print(f"Number of pages in the PDF: {len(pages)=}")
print(f"Number of 'Document' objects after splitting: {len(docs)=}")

Number of pages in the PDF: len(pages)=22
Number of 'Document' objects after splitting: len(docs)=77


And we can inspect the contents of each resulting `Document` object:

In [26]:
print(f"First 100 characters of the first Document: {docs[0].page_content[0:100]}")
print(f"First 100 characters of the second Document: {docs[1].page_content[0:100]}")


First 100 characters of the first Document: MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machi
First 100 characters of the first Document: related to the machine learni ng and all aspects of machin e learning. Paul Baumstarck 
works in mac


And the metadata:

In [29]:
print(f"Metadata of the first Document: {docs[0].metadata}")
print(f"Metadata of the second Document: {docs[1].metadata}")
print(f"Metadata of the tenth Document: {docs[9].metadata}")


Metadata of the first Document: {'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf', 'page': 0}
Metadata of the second Document: {'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf', 'page': 0}
Metadata of the tenth Document: {'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf', 'page': 2}


See how we can link back to our original PDF by having a look the `page` property of the metadata.

### Splitting by Tokens

You can split text and documents by tokens instead of by characters.

Splitting by tokens is useful because most of the LLMs have context windows whose size is designated by token counts, and therefore, these splitters will give us a better idea about how the LLMs will see those texts.

To get some intuition about how splitting by tokens work, let's initialize a `TokenTextSplitter` with `chunk_size = 1` and a `chunk_overlap = 0`. That will ensure the text we pass will be split into tokens:

In [1]:
from langchain_text_splitters import TokenTextSplitter

token_text_splitter = TokenTextSplitter(
    chunk_size=1,
    chunk_overlap=0,
)

text1 = "foo bar bazzyfoo"
token_text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

See how those three words end up creating a list of six tokens.

Let's now try with a general sentence.

In [3]:
text2 = "The quick brown fox jumps over the lazy dog."
token_text_splitter.split_text(text2)

['The',
 ' quick',
 ' brown',
 ' fox',
 ' jumps',
 ' over',
 ' the',
 ' lazy',
 ' dog',
 '.']

In this second example, the `token_text_splitter` simply returns a list of the words, some of them prefixed by space.

We can apply it in a similar way to our PDF:

In [4]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf")
pages = loader.load()

token_text_splitter = TokenTextSplitter(
    chunk_size=10,
    chunk_overlap=0,
)

docs = token_text_splitter.split_documents(pages)

We can have a look at the information in each chunk, and have a look at its metadata too.

In [14]:
print(f"Contents of the first Document/Chunk: {docs[0].page_content}")
print(f"Contents of the second Document/Chunk: {docs[1].page_content}")

Contents of the first Document/Chunk: MachineLearning-Lecture01  

Contents of the second Document/Chunk: Instructor (Andrew Ng):  Okay. Good


In [12]:
print(f"Metadata of the first Document/Chunk: {docs[0].metadata}")
print(f"Metadata of the second Document/Chunk: {docs[1].metadata}")
print(f"Metadata of the tenth Document/Chunk: {docs[80].metadata}")

Metadata of the first Document: {'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf', 'page': 0}
Metadata of the second Document: {'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf', 'page': 0}
Metadata of the tenth Document: {'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf', 'page': 1}


### Context-aware splitting

The metadata information of the Document/Chunk is a key concept you can bank on to get additional answers.

There are special splitters that enrich that metadata field for us.

For example, the `MarkdownHeaderTextSplitter` will split a markdown document by headers and populate the metadata field of the document accordingly to get additional context information.

Let's see it in action:

In [17]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

markdown_text = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\nHi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n \
## Chapter 2\n\n \
Hi this is Molly"""

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_text_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
)

md_header_splits = markdown_text_splitter.split_text(markdown_text)

Let's inspect the first document. We'll see that the metadata information of the split is enriched in a way that let us link back to the original document, thus providing additional context for our searches:

In [18]:
md_header_splits[0]

Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'})

The result can be understood as follows:
> The content from the 1st split comes from "# Title" &raquo; "## Chapter 1"

Similarly, for the second split:

In [19]:
md_header_splits[1]

Document(page_content='Hi this is Lance', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'})

We see that the content comes from "# Title" &raquo; "## Chapter 1" &raquo; "### Section"

Finally, for the third split:

In [20]:
md_header_splits[2]

Document(page_content='Hi this is Molly', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 2'})

The text comes from "# Title" &raquo; "## Chapter 2"

With this, we've done a quick walkthrough of how we can split a text document in semantically relevant chunks with appropriate metadata. This will lead us to proper storage of this information in a vector store.

## Step 3: Storage

![Step 3: Storage](pics/rag_step_3.png)

In this step we deal with storage, and we learn about *embeddings* and *vector stores*.

In the previous steps we've extracted data from our information sources with the document loaders, and created chunks with the splitters.

After splitting, we need to store that information in a convenient format for the subsequent steps.

*Embedding vectors* (or *embeddings* for short) is a way to represent content/meaning from text data as a numerical vector. Similar content or meaning will have similar vectors in this numeric space.

By using embeddings, we will be able to compare pieces of text and find similarities:

![Embedding](pics/embedding.png)

+ Embedding vector captures content/meaning of some text.
+ Text with similar content will have similar vectors

By way of embedding, we will be able to tell whether certain pieces of text are similar or not:

![Embedding similarity](pics/embedding-similarity.png)

Once embeddings have been created, we need a specialized database to store them. Because ultimately embeddings are vectors, that database is called a *vector store*.

A vector store will let us interrogate the embeddings to find the most similar to a one given. Each embedding vector will be associated to the original split from which the vector was created.

The storage process will be as follows:
1. Extract text from our data sources.
2. Split them in chunks/splits.
3. Create embeddings from those chunks.
4. Store the resulting embedding vectors on a vector store.

And for the bigger picture, if we consider the use case in which a user types a question and expects a response the additional needed steps will be:
1. Query the vector store to find the pieces of information (splits) that most closely resemble what the user is asking for.
2. Send those splits to the LLM to craft a response to the user.

The diagram below illustrates the overall process:

![Storage+Retrieval](pics/storage+retrieval.png)

Let's simulate the storage process with a real-world scenario by loading some information from a few PDFs.

Note that to make it even more realistic, we're loading a PDF twice.

In [2]:
from langchain_community.document_loaders import PyPDFLoader

loaders = [
    PyPDFLoader("data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf"),
    PyPDFLoader("data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf"),
    PyPDFLoader("data/pdfs/cs229_lectures_MachineLearning-Lecture02.pdf"),
    PyPDFLoader("data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf"),
]

docs = []
for loader in loaders:
    docs.extend(loader.load())

Then we proceed to do the splitting using the `RecursiveCharacterTextSplitter` with the default delimiters:

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150
)

splits = text_splitter.split_documents(docs)

len(splits)

209

So we end up with 209 splits with their corresponding content and metadata.

### Embeddings

Before creating the embeddings for our real-world scenario, let's develop some intuition about what the embeddings really are and how we create them.

Everything starts by creating an `embeddings` object through the `OpenAIEmbeddings()` function:

| NOTE: |
| :---- |
| The corresponding Azure OpenAI keys need to have been available as environment variables. |

In [4]:
from langchain_openai import AzureOpenAIEmbeddings

embeddings = AzureOpenAIEmbeddings(
    deployment=os.getenv("AZURE_OPENAI_TEXT_EMBEDDING_DEPLOYMENT_NAME")
)

Let's now define a few simple sentences and inspect the corresponding embedding vectors:

In [3]:
sentence1 = "I like dogs"
sentence2 = "I like canines"
sentence3 = "The weather is ugly outside"

In [7]:
embedding1 = embeddings.embed_query(sentence1)
embedding2 = embeddings.embed_query(sentence2)
embedding3 = embeddings.embed_query(sentence3)

In [9]:
for embedding in [embedding1, embedding2, embedding3]:
    print(f"Embedding vector: {embedding}")

Embedding vector: [-0.020944543635958147, 0.002272979871182798, -0.02557025532608112, -0.02327580427813735, -0.03003646125804415, 0.021631651094632342, -0.011472259896331832, -0.0060980889402820265, 0.011944647205492936, -0.020269703558502175, 0.006834276634038958, 0.030944426282130927, -0.001014558853178482, -0.004570499161415018, 0.003105792284146437, 0.011447720477282411, 0.03720202377454068, -0.001618079410480712, 0.00639256383152028, -3.0938100040575346e-05, -0.0076686229353161725, 0.005782141475298907, 0.010079638319220537, -0.03545971291483799, -0.00819008908257612, 0.01130661765359499, 0.010006019130749676, -0.0027300296941920404, -0.02782789864243465, -0.018404694672229775, 0.037422881339953265, -0.0017208389844497749, -0.01807341018675609, -0.02148441458033581, 0.005137977301469254, -0.02721440850958613, 0.0059079072784957625, -0.004680927478460012, -0.001823598558418838, -0.01404891752297304, -0.012539732541223746, 0.022171522039010006, 0.00868088118885494, -0.02425738849069

With the embeddings in place, we can use Math concepts to study their similarity.

For example, we can recall from 2D algebra that the *dot product* operation between two vectors provides us with a number that reflects whether two vectors are aligned: the greater their dot product, the more aligned the vectors are.

The same concept can be applied to vectors with more than two dimensions, and NumPy package can help us with that:

In [10]:
import numpy as np

np.dot(embedding1, embedding2)

0.9664565515783483

In [11]:
np.dot(embedding1, embedding3)

0.7650815196622498

In [12]:
np.dot(embedding2, embedding3)

0.7573246926801385

Note that the dot product between the first and second vector is greater than the dot product between the first and third, and greater than the dot product of the second and third, which means they are *similarly oriented*.

Note also that the degree of *dissimilarity* between the first and the third vector, and the second and the third vector is almost the same!

### Vector stores

Embedding vectors and the chunk they represent are stored in vector stores.

[Chroma](https://github.com/chroma-core/chroma) is a popular vector store that is both lightweight and in-memory.

It can also persist the database content in regular files, which makes it very easy to get started with.

Because the latest versions of Chroma requires a more modern version of sqlite3 than the one packaged with Python, it is necessary to run the following lines:

In [2]:
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

When using Chroma in LangChain, you should start by importing the corresponding package and configuring the path where the files will be stored:

In [3]:
from langchain_community.vectorstores import Chroma

persist_path = "./chroma_data/"

It is a good practice to remove any old db files that might be in the `chroma_data/` directory from previous runs:

In [7]:
!rm -rf $persist_path

Then, you can instantiate the Chroma DB client that represents the vector store:

In [8]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    persist_directory=persist_path
)

Now, we can use the client `vectordb` to interrogate the database contents:

In [9]:
vectordb._collection.count()

209

In [10]:
assert vectordb._collection.count() == len(splits)

### Similarity Search

With the vector store set up, you can start interrogating the PDFs to get specific answers:

In [22]:
question = "is there an email I can use to ask for help?"

LangChain provides the `similarity_search()` method to do so. You just need to provide a value for the parameter `k` which tells LangChain the number of documents to retrieve:

In [23]:
docs = vectordb.similarity_search(question, k=3)

Now we can browser the results of invoking such method:

In [20]:
docs[0].page_content

"MachineLearning-Lecture01  \nInstructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine \nlearning class. So what I wanna do today is ju st spend a little time going over the logistics \nof the class, and then we'll start to  talk a bit about machine learning.  \nBy way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so \nI personally work in machine learning, and I' ve worked on it for about 15 years now, and \nI actually think that machine learning is th e most exciting field of all the computer \nsciences. So I'm actually always excited about  teaching this class. Sometimes I actually \nthink that machine learning is not only the most exciting thin g in computer science, but \nthe most exciting thing in all of human e ndeavor, so maybe a little bias there.  \nI also want to introduce the TAs, who are all graduate students doing research in or \nrelated to the machine learni ng and all aspects of machin e learning. Paul Baumstarck \n

In [15]:
docs[1].page_content

"cs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So \nrather than sending us email individually, if you send email to this account, it will \nactually let us get back to you maximally quickly with answers to your questions.  \nIf you're asking questions about homework probl ems, please say in the subject line which \nassignment and which question the email refers to, since that will also help us to route \nyour question to the appropriate TA or to me  appropriately and get the response back to \nyou quickly.  \nLet's see. Skipping ahead — let's see — for homework, one midterm, one open and term \nproject. Notice on the honor code. So one thi ng that I think will help you to succeed and \ndo well in this class and even help you to enjoy this cla ss more is if you form a study \ngroup.  \nSo start looking around where you' re sitting now or at the end of class today, mingle a \nlittle bit and get to know your classmates. I strongly encourage you to f

In [16]:
docs[2].page_content

"So all right, online resources. The class has a home page, so it's in on the handouts. I \nwon't write on the chalkboard — http:// cs229.stanford.edu. And so when there are \nhomework assignments or things like that, we  usually won't sort of — in the mission of \nsaving trees, we will usually not give out many handouts in class. So homework \nassignments, homework solutions will be posted online at the course home page.  \nAs far as this class, I've also written, a nd I guess I've also revised every year a set of \nfairly detailed lecture notes that cover the te chnical content of this  class. And so if you \nvisit the course homepage, you'll also find the detailed lecture notes that go over in detail \nall the math and equations and so on  that I'll be doing in class.  \nThere's also a newsgroup, su.class.cs229, also written on the handout. This is a \nnewsgroup that's sort of a forum for people in  the class to get to  know each other and \nhave whatever discussions you want to ha 

So the results look good. We can also explore the metadata of the results:

In [24]:
for i, doc in enumerate(docs):
    print(f"metadata[{i}]: {doc.metadata=}")

metadata[0]: doc.metadata={'page': 5, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf'}
metadata[1]: doc.metadata={'page': 5, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf'}
metadata[2]: doc.metadata={'page': 5, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf'}


| NOTE: |
| :---- |
| The page indices start from zero, so `"page": 5` actually refers to page 6 in the document. |

You can use the `persist()` method to save the state of the `vectordb` in the file system, so that you don't need to go through the documents again.

In [25]:
vectordb.persist()

Note that with similarity search alone, we can obtain pretty good results:

In [26]:
question = "How does the instructor introduce himself?"
docs = vectordb.similarity_search(question, k=3)
docs[0].page_content

"MachineLearning-Lecture01  \nInstructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine \nlearning class. So what I wanna do today is ju st spend a little time going over the logistics \nof the class, and then we'll start to  talk a bit about machine learning.  \nBy way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so \nI personally work in machine learning, and I' ve worked on it for about 15 years now, and \nI actually think that machine learning is th e most exciting field of all the computer \nsciences. So I'm actually always excited about  teaching this class. Sometimes I actually \nthink that machine learning is not only the most exciting thin g in computer science, but \nthe most exciting thing in all of human e ndeavor, so maybe a little bias there.  \nI also want to introduce the TAs, who are all graduate students doing research in or \nrelated to the machine learni ng and all aspects of machin e learning. Paul Baumstarck \n

Because of this, we might wonder whether we need to invoke the LLM at all &mdash; isn't the similarity search sufficient?

The answer is no. We can outline a few drawbacks of using the similarity search alone.

The most evident issues, is that we see repetition in the results retrieved:

In [27]:
question = "What did they say about matlab?"
docs = vectordb.similarity_search(question, k=3)

In [28]:
docs[0].page_content

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of  this class, it will work for just

In [29]:
docs[1].page_content

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of  this class, it will work for just

In [30]:
docs[2].page_content

'into his office and he said, "Oh, professo r, professor, thank you so much for your \nmachine learning class. I learned so much from it. There\'s this stuff that I learned in your \nclass, and I now use every day. And it\'s help ed me make lots of money, and here\'s a \npicture of my big house."  \nSo my friend was very excited. He said, "W ow. That\'s great. I\'m glad to hear this \nmachine learning stuff was actually useful. So what was it that you learned? Was it \nlogistic regression? Was it the PCA? Was it the data ne tworks? What was it that you \nlearned that was so helpful?" And the student said, "Oh, it was the MATLAB."  \nSo for those of you that don\'t know MATLAB yet, I hope you do learn it. It\'s not hard, \nand we\'ll actually have a short MATLAB tutori al in one of the discussion sections for \nthose of you that don\'t know it.  \nOkay. The very last piece of logistical th ing is the discussion s ections. So discussion \nsections will be taught by the TAs, and atte ndan

It seems that because we loaded the same document twice:

```python
from langchain_community.document_loaders import PyPDFLoader

loaders = [
    PyPDFLoader("data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf"),
    PyPDFLoader("data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf"),
    PyPDFLoader("data/pdfs/cs229_lectures_MachineLearning-Lecture02.pdf"),
    PyPDFLoader("data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf"),
]
```

the semantic seach is returning the same chunk twice, which is not very helpful.

That is something that might (and will) happen in real-world scenarios, as you cannot always control the quality of ingested data.

Another less evident issue is that restricting the search to a particular portion of the total information that has been loaded does not work:

In [33]:
question = "What did they say about regression in the third lecture?"
docs = vectordb.similarity_search(question, k=5)

The expectations is that all the metadata should come from the document `data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf`, but the reality is that the results do not take into account that fact:

In [35]:
for i, doc in enumerate(docs):
    print(f"metadata[{i}]: {doc.metadata=}")

metadata[0]: doc.metadata={'page': 0, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf'}
metadata[1]: doc.metadata={'page': 2, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture02.pdf'}
metadata[2]: doc.metadata={'page': 14, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf'}
metadata[3]: doc.metadata={'page': 13, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf'}
metadata[4]: doc.metadata={'page': 17, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture02.pdf'}
metadata[5]: doc.metadata={'page': 0, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture02.pdf'}
metadata[6]: doc.metadata={'page': 6, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf'}


We see some responses come from the third lecture, but some others come from other documents.

This happens because the question includes a piece of structured information (*"bring results from the 3rd lecture"*) mixed with semantic information ("*bring results about regression*"), and the similarity search is simply doing a semantic lookup based on embeddings.

As a result, the structured information is not taken into account, because that part is not captured in the semantic embedding.

We can prove that by examining the results that have been retrieved from the second lecture.

The results are retrieved because they talk about regression in that lecture:

In [38]:
print(docs[1].page_content)

Instructor (Andrew Ng) :All right, so who thought driving could be that dramatic, right? 
Switch back to the chalkboard, please. I s hould say, this work was done about 15 years 
ago and autonomous driving has come a long way. So many of you will have heard of the 
DARPA Grand Challenge, where one of my colleagues, Sebastian Thrun, the winning 
team's drive a car across a desert by itself.  
So Alvin was, I think, absolutely amazing wo rk for its time, but autonomous driving has 
obviously come a long way since then. So what  you just saw was an example, again, of 
supervised learning, and in particular it was an  example of what they  call the regression 
problem, because the vehicle is trying to predict a continuous value variables of a 
continuous value steering directions , we call the regression problem.  
And what I want to do today is talk about our first supervised learning algorithm, and it 
will also be to a regression task. So for the running example that I'm going to use 
t

Because of these issues, the storage step and the corresponding similarity search are not enough and we need the retrieval step.

## Step 4: Retrieval

![Step 4: Retrieval](pics/rag_step_4.png)

One of the shortcomings of the similarity search is that sometimes we don't get the most relevant splits that have to do with the question from the user.

Retrieval is relevant at *query time* &mdash; you have received a query and you want to end up with the set of most relevant splits regarding the query.

In these sections we will discuss certain techniques that will help us get better results from our vector store.

These techniques can be broadly classified as follows:
+ Techniques to improve query results obtained from the vector store:
    + Maximum Marginal Relevance (MMR) to take into account diversity of results.
    + Using metadata in vector store queries

+ LLM-aided retrieval
    + SelfQuery
    + Compression

### Improving query results with MMR technique

Maximum Marginal Relevance (MMR) is a technique that lets you retrieve a set of diverse results from the vector database instead of only the most relevant ones.

That is, when using this technique you reduce the possibility that important information is missed when you query the vector store.

In this technique you:

1. Query the vector store to choose the most similar responses (using semantic similarity).
2. Amongst those responses, choose the most diverse ones.

The following diagram illustrates the idea behind this technique:

![MMR](pics/mmr_retrieval_technique.png)

### LLM-aided retrieval

The following sections will introduce the LLM-aided retrieval techniques so that you can see where those techniques help.

Note also that when using the LLM-aided techniques you will be making more calls to the language model, and therefore the cost of the solution will increase. However, these techniques really make all the difference when increase the quality of the overall solution.

#### SelfQuery

With LLM-aided retrieval you run the question through an LLM to be able to split the original question into two separate pieces:
+ a filter
+ a search term

Then, you'd be able to pass the filter to the vector store's metadata engine, and the search term to the vector store's similarity search engine.

The following picture illustrates the approach:

![LLM-aided retrieval (SelfQuery)](pics/llm-aided-retrieval-selfquery.png)

#### Compression

Another LLM-aided retrieval technique is the *Compression*.

You will have noticed that when you query the database you will end up with a set of `Document` objects that contain the most relevant information, but that maybe only two or three sentences will be useful to answer the query.

the idea behind this technique is to pass the relevant splits retrieved using the basic similarity search, so that only the most relevant segments are to be considered in the final language mode call.

The following diagram illustrates this approach:

![LLM-aided retrieval (Compression)](pics/llm-aided-retrieval-compression.png)

### Retrieval techniques in action

In this section we will explore how to implement the retrieval techniques using LangChain.

Because we have already gone through:
+ Step 1: Document loading
+ Step 2: Splitting
+ Step 3: Storage

we can directly instantiate our *embedding function*  that we will use to create the embedding vector of the user's query, and load the contents of the database from the persist directory:

Note that we need to unload SQLite3 default version and use a more modern one first:

In [2]:
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [3]:
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores import Chroma

persist_path = "./chroma_data/"

embeddings = AzureOpenAIEmbeddings(
    deployment=os.getenv("AZURE_OPENAI_TEXT_EMBEDDING_DEPLOYMENT_NAME")
)

vectordb = Chroma(persist_directory=persist_path, embedding_function=embeddings)

We can then make sure that all of our splits have been successfully retrieved:

In [4]:
print(vectordb._collection.count())
assert vectordb._collection.count() == 209

209


#### Using MMR to get more diverse results

To develop our intuition about how MMR technique works, lets load a smaller piece of information into a database with information about mushrooms:

In [5]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

Then, we create a small db with the splits from that information:

In [6]:
smalldb = Chroma.from_texts(texts, embedding=embeddings)

Let's confirm that it has been correctly loaded in memory:

In [7]:
assert smalldb._collection.count() == 3

Let's start querying the database and getting some splits using the basic similarity search:

In [9]:
question = "Tell me about all-white mushrooms with large fruiting bodies"
docs = smalldb.similarity_search(question, k=2)

for i, doc in enumerate(docs):
    print(f"contents doc[{i}]: {doc.page_content=}")


contents doc[0]: doc.page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'
contents doc[1]: doc.page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).'


While the results coming out of the loaded knowledge base are relevant (they are related to Amanita Phalloides which is a white mushroom with a large fruiting body), they are ignoring the fact that Amanita Phalloides is a poisonous mushroom.

Let's apply now the MMR retrieval technique, which is supposed to improve the quality of the results by including more diverse information, and see if that fact is retrieved:

![MMR](pics/mmr_retrieval_technique.png)

In [10]:
docs = smalldb.max_marginal_relevance_search(
    question,
    fetch_k=3,
    k=2
)

for i, doc in enumerate(docs):
    print(f"contents doc[{i}]: {doc.page_content=}")

contents doc[0]: doc.page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'
contents doc[1]: doc.page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.'


We see how applying MMR technique effectively retrieves better results, reducing duplication and including more diverse details from our knowledge base.

Let's confirm it by applying the same approach to our larger example. We will first use the basic similarity search, confirm that some of the information is duplicated, and then apply MMR and see that we get better results: 

In [13]:
question = "What did they say about matlab?"

docs = vectordb.similarity_search(question, k=3)

for i, doc in enumerate(docs):
    print(f"contents doc[{i}]: {doc.page_content=}")

contents doc[0]: doc.page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of 

We see that the results of the similarity search query returns duplicated information in the first two documents.

Our expectation once we apply the MMR technique, is that we will get more diverse deduplicated results:

| NOTE: |
| :---- |
| We're using the default `fetch_k` value. |

In [14]:
docs = vectordb.max_marginal_relevance_search(question, k=3)

for i, doc in enumerate(docs):
    print(f"contents doc[{i}]: {doc.page_content=}")

contents doc[0]: doc.page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of 

Note that the duplication has been removed from the result set when applying the MMR technique.

#### Using the LLM-aided *SelfQuery* technique

When using the LLM-aided *SelfQuery* technique, we improve the quality of the result set by running the query through an LLM to split the semantic search piece and the metadata search piece from the user's question:

![SelfQuery](pics/llm-aided-retrieval-selfquery.png)

We already saw that by itself, the basic similarity search is not smart enough to differentiate metadata information from semantic info in the question:

In [15]:
question = "What did they say about regression in the third lecture?"

docs = vectordb.similarity_search(question, k=3)

for i, doc in enumerate(docs):
    print(f"results metadata[{i}]: {doc.metadata=}")

results metadata[0]: doc.metadata={'page': 0, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf'}
results metadata[1]: doc.metadata={'page': 2, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture02.pdf'}
results metadata[2]: doc.metadata={'page': 14, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf'}


We see that results from the 2nd lecture has been pulled too, which is not ideal.

The `similarity_search()` method let us include the metadata portion of the query as a filter:

In [16]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source": "data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf"}
)

for i, doc in enumerate(docs):
    print(f"results metadata[{i}]: {doc.metadata=}")

results metadata[0]: doc.metadata={'page': 0, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf'}
results metadata[1]: doc.metadata={'page': 14, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf'}
results metadata[2]: doc.metadata={'page': 4, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf'}


Note that now all the results come from the third lecture.

By using *SelfQuery* we will be able to get the same results, but without having to write the filter ourselves.

For this to work, we have to make the language model aware of the nature of the document's metadata. This should be as descriptive as possible.

If you inspect the results metadata, you see that we only have two fields: `source` and `page`.

As a first step, we will need to create a metadata field descriptor like the one below.

In [6]:
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture from where the chunk is from, should be one of `data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf`, `data/pdfs/cs229_lectures_MachineLearning-Lecture02.pdf`, or `data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf`",
        type="string"
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer"
    )
]

Note that description for the `source` metadata field includes the actual possible values:
+ `data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf`
+ `data/pdfs/cs229_lectures_MachineLearning-Lecture02.pdf`
+ `data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf`

Should the location change, the information below should be updated too.

Now we're ready to apply the *SelfQuery* technique, which is implemented in LangChain via the `SelfQueryRetriever` object:

In [7]:
from langchain_openai import AzureOpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever

document_content_description = "Lecture notes"
llm = AzureOpenAI(
    deployment_name=os.getenv("AZURE_OPENAI_COMPLETION_DEPLOYMENT_NAME"),
    model_name="gpt-35-turbo-instruct",
    temperature=0
)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

With the retriever in place, we can start interrogating our vector database and see if the retrieved results effectively come from the third lecture:

In [8]:
question = "what did they say about regression in the third lecture?"

docs = retriever.get_relevant_documents(question)

print(f"Number of documents retrieved: {len(docs)}")
for i, doc in enumerate(docs):
    print(f"results metadata[{i}]: {doc.metadata=}")

Number of documents retrieved: 4
results metadata[0]: doc.metadata={'page': 14, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf'}
results metadata[1]: doc.metadata={'page': 10, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf'}
results metadata[2]: doc.metadata={'page': 0, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf'}
results metadata[3]: doc.metadata={'page': 10, 'source': 'data/pdfs/cs229_lectures_MachineLearning-Lecture03.pdf'}


It must be noted that only models supporting the *Completions API* will work for *SelfQuery* techniques. In the example below, we've used `"gpt-35-turbo-instruct"`.

#### Using the LLM-aided *Compression* technique

The *Compression* technique, also known as *Contextual Compression*, let us get more focused results by removing extraneous information that might be present in the retrieved chunks.

![Compression technique](pics/llm-aided-retrieval-compression.png)

LangChain provides a `ContextualCompressionRetriever` object that simplifies the implementation of this technique:

In [5]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import AzureOpenAI

llm = AzureOpenAI(
    deployment_name="chat",
    model_name="gpt-35-turbo",
    temperature=0
)

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(),
)

With the compression retriever in place, we can ask a question and check that the results are more focused than when doing a basic similarity search:

First, we define a helper function to print the results:

In [13]:
def pretty_print_docs(docs):
    print(f"{len(docs)} document{'' if len(docs) == 1 else 's'} retrieved")
    print(f"\n{'-' * 100}\n".join([f"Document {i+1} (length: {len(d.page_content)}):\n\n" + d.page_content for i, d in enumerate(docs)]))


In [9]:
question = "What did they say about MATLAB?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1 (length: 1046):

those homeworks will be done in either MATLA B or in Octave, which is sort of — I 
know some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn't.  
So I guess for those of you that haven't s een MATLAB before, and I know most of you 
have, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. And it's sort of an extremely easy to  learn tool to use for implementing a lot of 
learning algorithms.  
And in case some of you want to work on your  own home computer or something if you 
don't have a MATLAB license, for the purposes of  this class, there's also — [inaudible] 
write that down [inaudible] MATLAB — there' s also a software package called Octave 
that you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it's free, and for the purposes of  this class, it will wo

While the results using the regular similarity search will typically be longer and therefore, less focused:

In [11]:
question = "What did they say about MATLAB?"
docs = vectordb.similarity_search(question)
pretty_print_docs(docs)

Document 1 (length: 1443):

those homeworks will be done in either MATLA B or in Octave, which is sort of — I 
know some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn't.  
So I guess for those of you that haven't s een MATLAB before, and I know most of you 
have, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. And it's sort of an extremely easy to  learn tool to use for implementing a lot of 
learning algorithms.  
And in case some of you want to work on your  own home computer or something if you 
don't have a MATLAB license, for the purposes of  this class, there's also — [inaudible] 
write that down [inaudible] MATLAB — there' s also a software package called Octave 
that you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it's free, and for the purposes of  this class, it will wo

Note that in the compressed docs we're getting duplicated results.

Nothing prevents us from applying both *Contextual Compression* (to get more focused results) and *MMR* (to reduce duplication and get more diverse results):

In [9]:
compression_mmr_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type="mmr")
)

question = "What did they say about Matlab?"

compressed_diverse_docs = compression_mmr_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_diverse_docs)



Document 1 (length: 1046):

those homeworks will be done in either MATLA B or in Octave, which is sort of — I 
know some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn't.  
So I guess for those of you that haven't s een MATLAB before, and I know most of you 
have, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. And it's sort of an extremely easy to  learn tool to use for implementing a lot of 
learning algorithms.  
And in case some of you want to work on your  own home computer or something if you 
don't have a MATLAB license, for the purposes of  this class, there's also — [inaudible] 
write that down [inaudible] MATLAB — there' s also a software package called Octave 
that you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it's free, and for the purposes of  this class, it will wo

Note that beyond responses #1 and #2 we're getting strange results.

#### Using other classical retrieval techniques

All the retrieval techniques we've seen so far are based on techniques implemented on top of vector stores, but LangChain also supports other retrieval techniques that are based on more traditional Natural Language Processing (NLP) and Machine Learning (ML) fields:

* SVM retriever
* TF-IDF retriever

The following snippets illustrate how to use these techniques:

In [11]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = PyPDFLoader("data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf")
pages = loader.load()
all_pages_text = [p.page_content for p in pages]
all_text_str = " ".join(all_pages_text)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150
)
splits = text_splitter.split_text(all_text_str)

svm_retriever = SVMRetriever.from_texts(splits, embeddings)
tfidf_retriever = TFIDFRetriever.from_texts(splits)


With the non-vectorstore based retrievers in place, we can start asking questions.

Let's start with the SVM retriever:

In [14]:
question = "What did they say about Matlab?"

docs_svm = svm_retriever.get_relevant_documents(question)
pretty_print_docs(docs_svm)

4 documents retrieved
Document 1 (length: 1435):

don't have a MATLAB license, for the purposes of  this class, there's also — [inaudible] 
write that down [inaudible] MATLAB — there' s also a software package called Octave 
that you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it's free, and for the purposes of  this class, it will work for just about 
everything.  
So actually I, well, so yeah, just a side comment for those of you that haven't seen 
MATLAB before I guess, once a colleague of mine at a different university, not at 
Stanford, actually teaches another machine l earning course. He's taught it for many years. 
So one day, he was in his office, and an old student of his from, lik e, ten years ago came 
into his office and he said, "Oh, professo r, professor, thank you so much for your 
machine learning class. I learned so much from it. There's this stuff that I learned in your 
class, and I now use every day. And it's help ed 



Similarly, using TF-IDF retriever:

In [15]:
question = "What did they say about Matlab?"

docs_tfidf = tfidf_retriever.get_relevant_documents(question)
pretty_print_docs(docs_tfidf)

4 documents retrieved
Document 1 (length: 1449):

Saxena and Min Sun here did, wh ich is given an image like this, right? This is actually a 
picture taken of the Stanford campus. You can apply that sort of cl ustering algorithm and 
group the picture into regions. Let me actually blow that up so that you can see it more 
clearly. Okay. So in the middle, you see the lines sort of groupi ng the image together, 
grouping the image into [inaudible] regions.  
And what Ashutosh and Min did was they then  applied the learning algorithm to say can 
we take this clustering and us e it to build a 3D model of the world? And so using the 
clustering, they then had a lear ning algorithm try to learn what the 3D structure of the 
world looks like so that they could come up with a 3D model that you can sort of fly 
through, okay? Although many people used to th ink it's not possible to take a single 
image and build a 3D model, but using a lear ning algorithm and that sort of clustering 
algorithm 

This finalizes all the exploration work we've done around "Step 4: Retrieval". We've seen several techniques, included LLM-aided ones such as *SelfQueryRetriever*, in which an LLM is used to come up with filters that include nested metadata structures that can be used to effectively filter out the information.

## Step 5: Output

![Step 5: Retrieval](pics/rag_step_5.png)

In this final step we take the documents that were retrieved as a result of the previous retrieval step, and the question written by the user, and pass them both to a language model and ask it to craft an answer to the question.

The following diagram is an annotated view of the process that identifies the different components:

![Question-Answer Detailed Workflow](pics/question_answer_detailed_wf.png)

As you can see, the default technique is to pass all the splits to the LLM in a single call. In order to do so, we have to make sure that the length of the information we pass to the LLM does not exceed the LLM's context window size.

Because the amount of `Document` objects retrieved can be large, and LLMs context windows are limited, we might find that all the information won't fit in a single call to the language model.

Three of the most popular methods to deal with this problem are:

1. Map_reduce
2. Refine
3. Map_rerank

### Using the default technique for Output (Question-Answering)

To use the default technique, in which the whole set of relevant results fit in the model's context window, and therefore can be sent in one shot to the LLM we have to instantiate the LLM telling it that we need it to answer a question.

| NOTE: |
| :---- |
| The default technique for question-answering is called `"stuff"`. |

This will require setting an extra parameter `temperature`, which we will set to zero (`temperature=0`), that will help get factual (rather than invented) answers. That is, this parameter controls the variability of the results, and affects the fidelity of the answers:

In [7]:
from langchain_openai import AzureChatOpenAI

llm = AzureChatOpenAI(
    openai_api_version="2023-05-15",
    azure_deployment="chat",
    model="gpt-35-turbo",
    temperature=0
)

Then we need to instantiate a `RetrievalQA` object to create a chain that first does a retrieval step to fetch relevant documents, then passes those documents into an LLM to generate a response.

A chain refers to sequence of calls (whether to an LLM, a tool, or a data processing step).

In [17]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

With the qa_chain in place, we can invoke it passing the original question from the user in the `"query"` field, as shown below. The response will be a dictionary object, with the LLM response available in the `"result"` property:

In [18]:
question = "What are the major topics for this class?"

answer_from_llm = qa_chain({"query": question})

answer_from_llm["result"]

  warn_deprecated(


'The major topics for this class are machine learning and its various applications. The course may also cover statistics and algebra as refreshers, and there will be discussion sections to go over extensions of the material covered in the main lectures.'

With the code above we've demonstrated the first end-to-end flow:
1. Document Loading: We identified a few PDFs that we used as our knowledge base.
2. Splitting: We created chunks of the documents of a certain size using a text splitter.
3. Storage: We created the corresponding embedding vectors for those chunks and store them in a vector store so that we could run a similarity search.
4. Retrieval: We got a question from the user, and interrogated the vector store to get a set of relevant splits (that is, a set of splits that have to do with the user's question).
5. Output: We sent the relevant splits along with the question to the LLM so that it could craft a proper answer to the user's question.

![End-to-End Question-Answer Flow](pics/rag_stages_hl.png)

But we can see that the chain hid a lot of processing under the hood to simplify the developer's experience.

Let's see what are the parameters we can play with to try to influence the final results.

The most important configuration parameter we can use is the *prompt*.

The prompt is the element that takes in the relevant splits, and the user's question and passes it to the language model.

Typically, you will define a prompt template which will be passed to the LLM. The prompt template will contain:
+ certain instructions about how to use the different pieces of the context. Context is the generic term for the information that results from the *"Retrieval"* step. In our case, the context are the relevant splits returned by a similarity search on the vector store when using the user's question as the query.

+ a placeholder for the context variable, such as `{context}`.

+ the user's question.

For example:

In [20]:
from langchain.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "Thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful answer:
"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

Note that the `{context}` placeholder is where the documents will go, and the `{question}` placeholder is where the question will go.

With the new prompt in place, we can instantiate a new Question-Answering chain with the same model and vector db, but a tailored prompt.

In [21]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

And we can start getting answers using our tailored prompt:

In [22]:
question = "Is probability a class topic?"

answer_from_llm = qa_chain({"query": question})

answer_from_llm["result"]

'Yes, probability is assumed to be a prerequisite for the class. The instructor assumes familiarity with basic probability and statistics, and will go over some of the prerequisites in the discussion sections as a refresher course. Thanks for asking!'

Another interesting parameter that we can use is `return_source_documents=True`. As you can imagine, this will make the LLM return a `"source_documents"` property with the reference to the documents used to infer the answer:

In [24]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
    return_source_documents=True
)

question = "Is probability a class topic?"

answer_from_llm = qa_chain({"query": question})

print(answer_from_llm["result"])
pretty_print_docs(answer_from_llm["source_documents"])

Yes, probability is assumed to be a prerequisite for the class. The instructor assumes familiarity with basic probability and statistics, and will go over some of the prerequisites in the discussion sections as a refresher course. Thanks for asking!
4 documents retrieved
Document 1 (length: 1415):

of this class will not be very program ming intensive, although we will do some 
programming, mostly in either MATLAB or Octa ve. I'll say a bit more about that later.  
I also assume familiarity with basic proba bility and statistics. So most undergraduate 
statistics class, like Stat 116 taught here at Stanford, will be more than enough. I'm gonna 
assume all of you know what ra ndom variables are, that all of you know what expectation 
is, what a variance or a random variable is. And in case of some of you, it's been a while 
since you've seen some of this material. At some of the discussion sections, we'll actually 
go over some of the prerequisites, sort of as  a refresher course under 

To reiterate, this demonstrates the default Output/Question-Answering technique that stuffs all the relevant splits into the prompt and invokes the LLM.

This is a good approach because it only involves one call to the language model, which is provided with the whole set of relevant information so that it can generate the proper answer. The drawback is that in many cases, if there's too much information and the LLM context window is small, all the relevant splits may not be able to fit in the context window, and we will be force to use another technique.

### Reducing the context size: Using the Map-Reduce technique

The size of the prompt when invoking a language model is limited. Because of that, when the amount of relevant information that you encode in the context is large, you will need to apply a technique to make it smaller.

One such technique is Map-Reduce. When using this approach, each of the individual documents is first sent to the language model by itself to get an original answer.

Then, those answers are composed into a final answer with a final call to the language model.

Note that this technique will involve many more calls to the LLM than you would use with the default technique, but you will be able to operate over an arbitrarily large number of documents.

![Map-Reduce technique](pics/map_reduce.png)

LangChain provides a parameter `chain_type` in its `RetrievalQA` object to apply the Map-Reduce technique:

In [8]:
from langchain.chains import RetrievalQA

qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)

In [9]:
question = "Is probability a class topic?"

answer_from_llm = qa_chain_mr({"query": question})
answer_from_llm["result"]

  warn_deprecated(


'There is no clear answer to this question based on the given portion of the document. The document mentions familiarity with basic probability and statistics as a prerequisite for the class, and there is a brief mention of probability in the text, but it is not clear if it is a main topic of the class. The instructor mentions using a probabilistic interpretation to derive a learning algorithm, but does not go into further detail about probability as a topic.'

Because there are multiple LLM calls (one per relevant split), the performance of this technique is expected to be much slower than the default one.

Also, this technique doesn't ensure that you'll get a better answer, because the final answer is *an ensemble* of the individual answers derived from each split, which might not contain the portion of the information the user is looking for.

That is the case in the previous answer which is quite verbose and kind of circular in nature.

Note that it is entirely possible to use the Map-Reduce technique and a custom prompt.

You should start defining your prompt for the *Map* phase:

In [None]:
from langchain.prompts import PromptTemplate

template = """What is the answer to the following question based on the provided context?
{context}
Question: {question}
Helpful answer:
"""

QUESTION_PROMPT = PromptTemplate.from_template(template)

Then, you can customize the prompt for the *Reduce* phase:

In [25]:
from langchain.prompts import PromptTemplate

template = """Use the following individual_relevant_answers to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "Thanks for asking!" at the end of the answer.
{individual_relevant_answers}
Question: {question}
Helpful answer:
"""

COMBINE_PROMPT = PromptTemplate.from_template(template)

In [26]:
from langchain.chains import RetrievalQA
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT, QA_PROMPT

qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce",
    chain_type_kwargs={
        "question_prompt": QA_PROMPT,
        "combine_prompt": COMBINE_PROMPT,
        "combine_document_variable_name": "individual_relevant_answers"
    }
)

In [27]:
question = "Is probability a class topic?"

answer_from_llm = qa_chain_mr({"query": question})
answer_from_llm["result"]

'It is mentioned in the context that the instructor will use probabilistic interpretation to derive the next learning algorithm, so it is likely that probability will be covered in the class. Thanks for asking!'

### Reducing the context size: Using the Refine technique

The Refine technique is another approach you can use when all the relevant information resulting from a similarity search won't fit on a single call to the LLM.

In this technique, each of the calls after the first one contains the answer resulting from the previous one:

![Refine technique](pics/refine.png)

LangChain provides the `chain_type="refine"` on the `RetrievalQA.from_chain_type()` function:

In [28]:
from langchain.chains import RetrievalQA

qa_chain_refine = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="refine"
)

In [30]:
question = "Is probability a class topic?"

answer_from_llm = qa_chain_refine({"query": question})
answer_from_llm["result"]

"The topic being discussed in the class is machine learning algorithms, specifically linear regression and classification. The instructor mentions that linear regression can be endowed with a probabilistic interpretation, which will be used to derive the next learning algorithm, the first classification algorithm. The instructor explains that classification problems involve predicting a discrete value, such as whether a patient has a disease or not, or whether a house will sell in the next six months or not. The instructor also mentions that the class will cover some statistics and algebra as a refresher in the discussion sections for those who need it. Additionally, the discussion sections will be used to cover extensions for the material that the instructor didn't have time to cover in the main lectures."

This is a better result than the Map-Reduce chain because information is chained sequentially, and there's less chance that relevant information is missed.

As in the other techniques, it is possible to customize the prompt:

In [36]:
from langchain.prompts import PromptTemplate

template = """The original question is as follows:
{question}

We have provided an existing answer, including sources (just the ones given in the metadata of the documents, don't make up your own sources):
{existing_answer}

We have the opportunity to refine the existing answer (only if needed) with some more context below:
{context_str}

Given the new context, add to the original answer to better answer the question. If you do update it, please update the sources as well. If the context isn't useful, print the original answer. The final answer should incorporate information from the original answer and the new context, but don't make use of phrases like 'additional context', 'original answer', 'new context', 'old answer' because we must hide this answer updation process from end users. Always say "Thanks for asking!" at the end of the answer.
"""

QA_CHAIN_REFINE_PROMPT = PromptTemplate(
    input_variables=["question", "existing_answer", "context_str"],
    template=template
)

In [37]:
from langchain.chains import RetrievalQA

qa_chain_refine = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="refine",
    chain_type_kwargs={"refine_prompt": QA_CHAIN_REFINE_PROMPT}
)

In [38]:
question = "Is probability a class topic?"

answer_from_llm = qa_chain_refine({"query": question})
answer_from_llm["result"]

"Yes, probability is a class topic that may involve some programming, mostly in MATLAB or Octave, but will not be very programming intensive. Students are expected to know what random variables, expectation, variance, and matrices and vectors are. If some students need a refresher, there will be review sections available. Most undergraduate statistics and linear algebra courses should provide sufficient background knowledge. The class may also cover probabilistic interpretation in order to derive learning algorithms, including classification algorithms. Classification problems involve predicting a discrete value, such as whether a patient has a disease or not, or whether a house will sell in the next six months or not. Additionally, the class will have discussion sections to go over extensions for the material that the professor is teaching in the main lectures. These extensions will cover some aspects of machine learning that the professor didn't have time to cover in the main lecture

### Reducing the context size: Using the Map-Rerank technique

The Map-Rerank technique is yet another approach you can use when all the relevant information resulting from a similarity search won't fit on a single call to the LLM.

In this technique, each of the relevant chunks are sent to the LLM model and also scored. The answer with the highest score is selected:

![Map-Rerank technique](pics/map_rerank.png)

You can easily apply this technique in LangChain using `chain_type="map_rerank"` in the `RetrievalQA.from_chain_type()` function:

In [39]:
from langchain.chains import RetrievalQA

qa_chain_map_rerank = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_rerank"
)

In [40]:
question = "Is probability a class topic?"

answer_from_llm = qa_chain_map_rerank({"query": question})
answer_from_llm["result"]



'Yes, probability is assumed to be a prerequisite for the class.'

As in the techniques already seen, you can customize the prompt, although it's a bit more contrived that in the other cases as you'll need to manage the scoring capabilities.

Let's start with the scoring functionality.

This requires a RegexParser that will be applied to the answer identified by the the `"answer"` key and the score identified by the `"score"` key.

The answer with the highest score is the one that will be returned.

In [46]:
from langchain.output_parsers.regex import RegexParser

output_parser = RegexParser(
    regex=r"(.*?)\nScore: (\d*)",
    output_keys=["answer", "score"]
)

Then we can define the prompt, which should be sufficiently detailed so that it is clear how the ranking should be performed:

In [47]:
from langchain.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

In addition to giving an answer, also return a score of how fully it answered the user's question. This should be in the following format:
Question: [question here]
Helpful Answer: [answer here]
Score: [score between 0 and 100]

How to determine the score:
- Higher is a better answer
- Better responds fully to the asked question, with sufficient level of detail
- If you do not know the answer based on the context, that should be a score of 0
- Don't be overconfident!

Example #1

Context:
---------
Apples are red
---------
Question: what color are apples?
Helpful Answer: red
Score: 100

Example #2

Context:
---------
it was night and the witness forgot his glasses. he was not sure if it was a sports car or an suv
---------
Question: what type was the car?
Helpful Answer: a sports car or an suv
Score: 60

Example #3

Context:
---------
Pears are either red or orange
---------
Question: what color are apples?
Helpful Answer: This document does not answer the question
Score: 0

Begin!

Context:
---------
{context}
---------
Question: {question}
Helpful Answer:
"""

MAP_RERANK_CHAIN_PROMPT = PromptTemplate.from_template(
    template,
    output_parser=output_parser
)

In [49]:
from langchain.chains import RetrievalQA

qa_chain_refine = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_rerank",
    chain_type_kwargs={"prompt": MAP_RERANK_CHAIN_PROMPT}
)

In [52]:
question = "Is probability a class topic?"

answer_from_llm = qa_chain_map_rerank({"query": question})
answer_from_llm["result"]



'Yes, probability is assumed to be a prerequisite for the class.'

## Creating a chatbot

In the previous section, we've gone through all the steps of the Retrieval Augmented Generation (RAG) and developed some intuition around the different concepts and techniques used in each step.

Now, we can start thinking about implementing an application that includes RAG to solve a portion of the solution, for example a chatbot.

In this section we will see what else is needed to implement such an application and it will serve you to understand additional challenges you might face, as well as to learn additional techniques.

The main characteristic we're missing from a chatbot application is the management of the state.

Let's understand what we mean by that by asking a follow-up question to the LLM and checking that it doesn't work as expected:

In [8]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

Let's ask a question, and immediately after ask for a clarification:

In [9]:
question = "Is probability a class topic?"

answer_from_llm = qa_chain({"query": question})
answer_from_llm["result"]

  warn_deprecated(


'Yes, probability is a topic in this class. The instructor assumes familiarity with basic probability and statistics, and mentions that most undergraduate statistics classes will be more than enough preparation.'

In [10]:
question = "Why are those prerequisites needed?"

answer_from_llm = qa_chain({"query": question})
answer_from_llm["result"]

'The prerequisites are needed because the class assumes that all students have a basic knowledge of computer science and computer skills and principles. This includes knowledge of big-O notation and other basic concepts. Without this basic knowledge, it may be difficult for students to understand the material covered in the class.'

Note that the answer elaborates on the lecture prerequisites instead of taking into account that we were interested in the prerequisites that had to do with probability.

This happens because the chain doesn't have yet the concept of state.

The following diagram illustrates what we want to achieve:

![chatbot](pics/chatbot.png)

We start by making sure we have our vector store correctly configured with our knowledge base and run a basic scenario to see that by default chat history is not taken into account, then we will add memory to it and see that it works.

First, we unload the default pysqlite3 version to load the more modern one:

In [7]:
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

Then we retrieve our knowledge base from the file system:

In [8]:
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores import Chroma

persist_path = "./chroma_data/"

embeddings = AzureOpenAIEmbeddings(
    deployment=os.getenv("AZURE_OPENAI_TEXT_EMBEDDING_DEPLOYMENT_NAME")
)

vectordb = Chroma(persist_directory=persist_path, embedding_function=embeddings)

And we check we have the documents loaded:

In [9]:
print(vectordb._collection.count())
assert vectordb._collection.count() == 209

209


Let's now a run a basic similarity search as a sort of shakedown test:

In [10]:
question = "What are the major topics for this class?"
docs = vectordb.similarity_search(question, k=3)
len(docs)

3

Now we instantiate our `llm` object that will allow us to interact with the underlying LLM.

Note that because we're in *chat mode* we use the `AzureChatOpenAI()` function. This will make the model behave like a chatbot.

In [34]:
from langchain_openai import AzureChatOpenAI

llm = AzureChatOpenAI(
    deployment_name="chat",
    model_name="gpt-35-turbo",
    temperature=0
)

For example, we can start chatting with the model without taking into account our knowledge base using the `predict` method:

In [35]:
llm.predict("Hello, world!")

  warn_deprecated(


'Hello there! How can I assist you today?'

Now we need to add some memory to it, as depicted in our blueprint for the chatbot:

![Chatbot](pics/chatbot.png) 

LangChain provides a `ConversationBufferMemory` object that can be used to implement our chat history.

That object will be in charge of keeping a list of the previous chat messages, and will pass them along with the question to the chatbot every time.

The object can be instantiated by passing the `memory_key` that will identify the placeholder in the prompt. We will also set `return_messages=True` to instruct the object to return the history as a list of messages rather than a single string:

In [36]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

For our chatbot, we will need to use a different chain from the usual `RetrievalQA` we've been using.

The `ConversationalRetrievalChain` will allow us to pass our recently created memory object:

In [37]:
from langchain.chains import ConversationalRetrievalChain

qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=vectordb.as_retriever(),
    memory=memory
)

The `ConversationalRetrievalChain` takes the history and the new question and condenses it into a standalone question to pass to the vector store to look up relevant documents.

Let's see it in action:

In [38]:
question = "Is probability a class topic?"

llm_answer = qa({"question": question})
llm_answer

{'question': 'Is probability a class topic?',
 'chat_history': [HumanMessage(content='Is probability a class topic?'),
  AIMessage(content='Yes, probability is a topic assumed to be familiar to students in this class. The instructor mentions that familiarity with basic probability and statistics is assumed, and that most undergraduate statistics classes would be sufficient preparation.')],
 'answer': 'Yes, probability is a topic assumed to be familiar to students in this class. The instructor mentions that familiarity with basic probability and statistics is assumed, and that most undergraduate statistics classes would be sufficient preparation.'}

Now, we can ask a follow up question:

In [39]:
question = "Why are those prerequisites needed?"
llm_result = qa({"question": question})
llm_result["answer"]

'The reason for requiring familiarity with basic probability and statistics as prerequisites for this class is that the class assumes that students already know what random variables are, what expectation is, what a variance or a random variable is. The class also assumes that students are familiar with basic linear algebra, such as knowing what matrices and vectors are, how to multiply matrices and vectors, and what a matrix inverse is. These concepts are fundamental to understanding the material covered in the class.'

And the language model takes into account our previous conversation to generate the answer.

The final thing we can do is wrapping everything together into a function to facilitate the different steps to be taken when you want to implement a chatbot that talks to your documents:

In [42]:
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores import DocArrayInMemorySearch
from langchain_openai import AzureChatOpenAI
from langchain.chains import RetrievalQA


load_dotenv()


def load_db(file, k=3):
    # Load documents from file
    loader = PyPDFLoader(file)
    documents = loader.load()

    # Generate splits
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=150
    )
    splits = text_splitter.split_documents(documents)

    # Create the embeddings function
    embeddings = AzureOpenAIEmbeddings(
        deployment=os.getenv("AZURE_OPENAI_TEXT_EMBEDDING_DEPLOYMENT_NAME")
    )

    # Get an in-memory vector db with the existing splits and embedding function
    db = DocArrayInMemorySearch.from_documents(splits, embeddings)

    # Instantiate the retriever
    retriever = db.as_retriever(k=k)


    # Instantiate the LLM
    llm = AzureChatOpenAI(
        deployment_name="chat",
        model_name="gpt-35-turbo",
        temperature=0
    )

    # Return a chatbot chain, with memory managed externally
    qa = RetrievalQA.from_llm(
        llm,
        retriever=retriever,
        return_source_documents=True
    )
    return qa


Now we can start using it with any custom document:

In [46]:
qa = load_db("data/pdfs/cs229_lectures_MachineLearning-Lecture01.pdf")

question = "Who is giving the lecture?"
llm_answer = qa({"query": question})
llm_answer["result"]

'The lecture is being given by Andrew Ng.'