# Lab 3: Retrieval Augmented Generation

## Overview

Now that we have an understanding of tokenization, vectorization, and can interact with our local model, we will turn our attention to leveraging that model to obtain useful responses. To do this, we need to understand how to preprocess our data effectively and make use of a vector database. In this lab we will do just this; preprocess input data into chunks, vectorize the data, store the vectors and metadata into a vector database, and then leverage that database to create useful input to the LLM with the hope of generating a useful and meaningful response.

## Goals
By the end of this lab you should know how or be able to:

 * Preprocess PDF documents.
 * Generate text embeddings for the chunks.
 * Store the embeddings into a vector database along with useful metadata.
 * Query the vector database for stored vectors nearest a user question.
 * Create a simple, but effective, prompt that provides guardrails on the model's response and ensures the response has no "hallucinations."

## Estimated Time: 60 minutes

# On Fine Tuning

There are a number of libraries and a large number of discussions available on the finetuning of language models. Finetuning the model can be accomplished in two main ways:

 * Train the entire model.
 * Freeze the model, add adapter layers and train the adapter layers only.

## Finetuning the Entire Model

Tuning the entire model seems very cool but has a number of drawbacks. First, you must have sufficient compute and memory capability to load the entire model and at least one batch of training data. Given the sizes of even the smallest of LLMs, you would either need to purchase at least one nVidia A100 40gb GPU for something around 8,000 dollars or you would rent time on a cloud based GPU for more than $2 per hour (for a 40gb A100... A single cloud based 80gb H100 rents for nearly 5 dollars per hour), which extends to between 1,500 and 3,700 dollars for the GPU rental alone, which does not include instance, IO, storage, and other costs. Still, if you only need to fine tune the model once, this might make good financial sense. Unfortunately, this is only the first drawback.

Not only is finetuning the entire model costly, it is also time consuming. Fine tuning will typically take at least hours but far more likely days, or even many days. If you are attempting to fine tune a model on consumer grade hardware, you would likely choose to go the route of LoRA (Low Rank Adaptation) or QLoRA (Quantized Low Rank Adaption), which are strategies for vastly reducing the memory footprint required for a model by sacrificing overall accuracy. Essentially, rather than representing a floating point value as something like $3.981237871$, a low rank variant (think "fewer bits") might be $3.981$. While this takes far less memory, we've also lost a significant number of digits. In LoRA this low rank approximation is applied to the weights matrices within the model, reducing its overall size. QLoRA goes a step further, additionally reducing the number of bits of precision in the adapter matrices (the weights within the layers added to the model to accommodate fine tuning).

Another drawback is that if we choose to allow the entire model to be finetuned, we run the very real risk of *catastrophic forgetting*. What this means is that our training data might end up adjusting some of the weights within the pretrained model in such a way as to create a sort of "domino effect" of cascading failures, resulting in a model that is effectively unusable.

To prevent this from happening, it is more common to *freeze* (prevent any of the trained paramters in the LLM from being updated during training) the pretrained model and add "adapter" layers to it, then train the adapter layers. While this is not truly accurate, you can think about it as adding a couple of layers that take care of nudging the LLMs output closer to something matching our training data or how we would like it to respond. For example, perhaps we would like the LLM to respond to the question, "Who created you?" with our own company named rather than something like "Meta." If we were to include that type of question in our fine tuning data, the adapter layers would learn to steer the answer, sort of filtering it, toward the answer we desire.

The final drawback that, at least to me, is very significant is that in all of these cases the model is still prone to the ill-termed behavior of *hallucination.* Since the model truly is just predicing the next most likely token based on the preceding context, this overly anthropomorphizes the model, creating the impression that it is thinking. In reality, the model is simply generating a bad and likely false response. Often people will try hard to compensate for this with long and arduous prompt engineering, but there is a better way.

> We will include a solution notebook that demonstrates how to do finetuning should you wish to take that path, but we do not intend to cover finetuning in any additional detail.

## RAG over Finetuning

Even with finetuning, we have no real way to prevent the model from occasionally generating answers that are just wrong. Still, how can we get the model to give answers based on our data or some specific set of data? *Retrieval Augmented Generation*, or RAG, is the answer. Using this approach with some very simple prompt engineering we can leverage the model for what it is good at (text summarization) and word or sentence embeddings for what they are good at (locating thoughts or phrases that are similar to or related to the question posed). Let's dive in.


# <img src="../images/task.png" width=20 height=20> Task 3.1

We will need various libraries to complete the lab ahead. Specifically, we need to import:

* `json`, a Python library for encoding and decoding JSON objects.
* `requests`, a Python library for HTTP based interactions.
* From `sentence_transformers`, a Python library from Huggingface that implements the SBERT (Sentence based Bidirectional Encoder Representations from Transformers) library, the `SentenceTransformer` class. We will use this for our embeddings generation.
* `PdfReader` from `pypdf`, a PDF parsing and processing library.
* `RecursiveCharacterTextSplitter` from `langchain_text_splitters`, a comprehensive library for building text processing pipelines for LLMs.
* `Path` from `pathlib`. The `pathlib` library is a modern set of methods and classes for manipulating file paths.

Please import these libraries using the next cell:

In [1]:
import requests
import json
from sentence_transformers import SentenceTransformer
from pypdf import PdfReader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pathlib import Path

  from tqdm.autonotebook import tqdm, trange
2024-11-25 17:48:46.128570: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-25 17:48:46.136448: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-25 17:48:46.145836: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-25 17:48:46.148648: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-25 17:48:46.155700

# <img src="../images/task.png" width=20 height=20> Task 3.2

In our first lab we learned about word embeddings and how they are trained. We understand that these embeddings are developed by attempting to predict either the context of a word or the word based on the context. These embeddings can then be used in the place of the word and carry some type of semantic meaning.

What we need to do is extend this notion of a word embedding to something larger, like a sentence or a paragraph. Think about why this is the case. Using word embeddings, we saw that it was possible to do arithmetic using the embedding vectors to find other words. What if we were able to generate a vector that captures important characteristics about sentences and paragraphs? Shouldn't it be possible to use some kind of math to find other vectors that are close to that chunk of text? Yes!

There are a number of different approaches used to generate embeddings for larger pieces of text. The simplest of these might do something like taking the arithmetic mean (average) of all the vectors for all the words that make up that chunk. While this does work to a degree, as you can imagine it's not the best way to capture the overall meaning of a sentence.

We will use a pre-trained set of embeddings known as SBERT (Sentence Bidirectional Encoder Representations from Transformers). While the training of SBERT embeddings is a bit more complicated than the word embeddings that we worked with, the overall notion is exactly the same. Since these embeddings are generated from chunks of text rather than words, they also tend to be much better than taking a simple average of word embeddings.

The `sentence-transformers` library makes a number of different SBERT models available:

| Model Name | Size |
|------------|--------|
| all-mpnet-base-v2 | 420 MB | 
| multi-qa-mpnet-base-dot-v1 | 420 MB | 
| all-distilroberta-v1 | 290 MB | 
| all-MiniLM-L12-v2 | 120 MB | 
| multi-qa-distilbert-cos-v1 | 250 MB | 
| all-MiniLM-L6-v2 | 80 MB | 
| multi-qa-MiniLM-L6-cos-v1 | 80 MB | 
| paraphrase-multilingual-mpnet-base-v2 | 970 MB | 
| paraphrase-albert-small-v2 | 43 MB | 
| paraphrase-multilingual-MiniLM-L12-v2 | 420 MB | 
| paraphrase-MiniLM-L3-v2 | 61 MB | 
| distiluse-base-multilingual-cased-v1 | 480 MB | 
| distiluse-base-multilingual-cased-v2 | 480 MB | 

For our class, we will make use of the `multi-qa-distilbert-cos-v1` embeddings. This choice is a middle-of-the-road choice in terms of memory requirements, and it is a model trained using question and answer pairs (that's what the `qa` part of the name means). What about the `cos` part? This stands for Cosine and indicates that the vectors in this model have a very specific property; they are all scaled to be of length 1 in the embedding space.

What does this mean for us? It tells us that this model is intended to be used for text similarity searches such as Cosine Similarity (image this to mean, "Are these two vectors pointing in the same direction?", Euclidean Distance (typically used in K-Nearest Neighbors types of searches), and Dot Product (which is a specific Linear Algebra operation that literally determines whether the vectors are pointing in the same direction). Are there other types of embeddings available? Yes. The most common "other" form is the `dot` embeddings. While these can be used for the Dot Product similarity test, they are not as useful for Euclidean distance types of search or Cosine Similarity, which is a very typical way to search for similar vectors. While you can choose to use the `-dot-` versions of the model, our experience is that the `-cos-` models function better with various similarity searches.

Using the next cell, instantiate a `SentenceTransformer()` object, passing in `'sentence-transformers/multi-qa-distilbert-cos-v1'` as an argument. This will download and initialize the DistilBERT SBERT model for our use. Assign this model to a variable named `model`.


In [2]:
model = SentenceTransformer('sentence-transformers/multi-qa-distilbert-cos-v1')



# <img src="../images/task.png" width=20 height=20> Task 3.3

Let's experiment with the model so that we have an understanding of what it does. Using the next cell, create a list of strings named `strings`. Assign to this list the following strings:

 * The sky is blue today.
 * Machine learning uses mathematics to find patterns in data.
 * Neural networks are trained using backpropagation.
 * Backpropagation is the training method used to update neurons.

Use the `model.encode()` method to encode the first string. Print out the embedding that is returned and the `shape` of that embedding (i.e., the `.shape` attribute of the value returned from the call to `encode()`).

In [3]:
strings = [
    "The sky is blue today.",
    "Machine learning uses mathematics to find patterns in data.",
    "Neural networks are trained using backpropagation.",
    "Backpropagation is the training method used to update neurons."
]

model.encode(strings[0]), model.encode(strings[0]).shape

(array([ 2.81845890e-02, -1.67494025e-02,  4.93362406e-03,  2.52056643e-02,
         9.95677989e-03,  4.09304202e-02, -9.28256835e-04,  1.05975002e-01,
        -6.20508417e-02, -4.82722744e-02, -1.86031237e-02, -5.28222397e-02,
        -5.58008887e-02,  6.68877363e-02, -1.46532189e-02,  5.65091111e-02,
        -2.10039206e-02, -2.24468708e-02,  2.78438423e-02, -7.09581189e-03,
         6.86126016e-03,  1.24178827e-02, -8.08446016e-03, -1.17118619e-02,
         3.57014760e-02, -3.37228775e-02, -7.50888288e-02,  3.78814787e-02,
         1.08273495e-02,  2.81119402e-02, -3.63185816e-02, -2.31127851e-02,
         2.02871747e-02,  2.48177797e-02, -8.47086590e-03, -1.15900170e-02,
        -6.84459135e-02, -1.25005227e-02, -3.09792329e-02,  1.63190458e-02,
         3.28300633e-02, -3.57349366e-02,  6.98456764e-02, -5.29162064e-02,
         1.30857173e-02,  8.31280928e-03,  1.88670587e-02,  5.31996116e-02,
         5.06243445e-02, -2.37999689e-02,  1.17153041e-02,  1.05823968e-02,
        -1.5

Hopefully you can see that the vector embedding looks very much like (even better, is identical to) the one in the solution. You can also see that each of these vectors is 768 dimensions, or $\mathbb{R}^{768}$ space.

# <img src="../images/task.png" width=20 height=20> Task 3.4

The model has a `similarity()` method associated with it:

>```Help on function dot_score in module sentence_transformers.util:
> dot_score(a: 'list | np.ndarray | Tensor', b: 'list | np.ndarray | Tensor') -> 'Tensor'
>    Computes the dot-product dot_prod(a[i], b[j]) for all i and j.
>    
>    Args:
>        a (Union[list, np.ndarray, Tensor]): The first tensor.
>        b (Union[list, np.ndarray, Tensor]): The second tensor.
>    
>    Returns:
>        Tensor: Matrix with res[i][j] = dot_prod(a[i], b[j])

This function leverages the Linear Algebra dot product operation to determine a "score" for how similar two vectors are as a single value. The greater this value is, the more similar, or related, the two sentences are. A score close to zero implies the two vectors are not related. A negative value indicates that they are negatively related. In other words, they are talking about or "moving" in opposite directions.

If you think about the phrases that we have embedded, the first sentence has nothing to do with the other three. The second and third, while not expressing the same idea, are definitely related. The last two are saying pretty much the same thing in different ways.

Use the next cell to generate and print out the `similarity()` score for each pair of sentences. Which pair has the greatest similarity value?

In [4]:
def similarity(a, b):
    print(f'{model.similarity(model.encode(a),model.encode(b))}\t{a}, {b}')

_ = [similarity(a,b) for idx, a in enumerate(strings) for b in strings[idx + 1:]]

tensor([[-0.0151]])	The sky is blue today., Machine learning uses mathematics to find patterns in data.
tensor([[-0.0273]])	The sky is blue today., Neural networks are trained using backpropagation.
tensor([[-0.0705]])	The sky is blue today., Backpropagation is the training method used to update neurons.
tensor([[0.2563]])	Machine learning uses mathematics to find patterns in data., Neural networks are trained using backpropagation.
tensor([[0.1926]])	Machine learning uses mathematics to find patterns in data., Backpropagation is the training method used to update neurons.
tensor([[0.7777]])	Neural networks are trained using backpropagation., Backpropagation is the training method used to update neurons.


# <img src="../images/task.png" width=20 height=20> Task 3.5

That's pretty cool! You can see that the first sentence, while slightly negative in relation to the others, is very close to zero, or "not related."  The second sentence has a stronger relationship with the last two, which is correct. Finally, the last two sentences, which are both saying the same thing in different ways have a very strong match at 0.777.

To make use of this, then, we need to take some text (or corpus) of interest and convert it into chunks of text that we can then convert into embeddings. If we can accomplish this, we could store the embeddings somewhere and then leverage them to find related ideas in the corpus of text that we have processed. Let's do exactly this.

There are many types of text data you may want to process. We're going to take a look at a few of the more common types of source data in this class. The first of these, and our focus in this lab, will be PDF files.

To read in our PDF data, we are going to use the `PdfReader()` class from PyPDF. This class provides a very convenient interface for processing and distilling the text in a PDF file. The files that we are interested in loading are located in the relative path `'../data/source_docs/'`. Rather than pre-configuring a list of files to load, why don't we dynamically determine the names of the files in that directory so that we have the beginnings of a sort of "pipeline" to take source text to vector embeddings.

Use the following cell to retrieve the list of files in the source directory. Store this list in a variable named `files`. You might use `list(Path('../data/source_docs/').glob('*.pdf'))` to obtain the list of PDF files.


In [5]:
path = Path('../data/source_docs/')
files = list(path.glob('*.pdf'))
files

[PosixPath('../data/source_docs/NIST.SP.800-12r1.pdf'),
 PosixPath('../data/source_docs/Incident_Handling.pdf'),
 PosixPath('../data/source_docs/DEV543.pdf'),
 PosixPath('../data/source_docs/NIST.SP.800-53r5.pdf')]

# <img src="../images/task.png" width=20 height=20> Task 3.6

Let's just focus on the *NIST SP 800-53* file for now.

The `PdfReader()` class accepts a file path to a PDF file as an argument. It will read in and process this file, allowing us to interrogate it using the object that we create. 

Identify at which offset the "NIST.SP.800-53r5.pdf" file is found in the list of files. In the following cell, use `PdfReader()` to process this file and assign the output to a variable named `document`.

**Note:** *Do not be concerned if you receive messages stating "Ignoring wrong pointing object..." if you choose to process some of the other files.*

In [6]:
document = PdfReader(files[3])

# <img src="../images/task.png" width=20 height=20> Task 3.7

The document is represented in memory as a series of page objects. Let's do two things:

 * Please print out the `len()` of `document.pages`. This will tell us how many pages there are.
 * Print out `str(document.pages[10])[:1000]` (i.e., the first 1,000 characters of the document when viewed as a string).

There's nothing special about page 10; I just want us to skip somewhere into the body of the document rather than the title page to get a look at how the pages are represented.

In [7]:
print(f'This document has {len(document.pages)} pages.')
str(document.pages[10])[:1000]

This document has 492 pages.


"{'/Contents': IndirectObject(25, 0, 139822345706256), '/CropBox': [0.0, 0.0, 612, 792], '/Group': IndirectObject(2069, 0, 139822345706256), '/MediaBox': [0.0, 0.0, 612, 792], '/Parent': IndirectObject(60119, 0, 139822345706256), '/Resources': {'/ColorSpace': {'/CS0': IndirectObject(60270, 0, 139822345706256), '/CS1': IndirectObject(2558, 0, 139822345706256), '/CS2': IndirectObject(60271, 0, 139822345706256)}, '/ExtGState': {'/GS0': IndirectObject(2556, 0, 139822345706256)}, '/Font': {'/TT0': IndirectObject(60279, 0, 139822345706256), '/TT1': IndirectObject(2559, 0, 139822345706256), '/TT2': IndirectObject(60273, 0, 139822345706256)}, '/XObject': {'/Im0': IndirectObject(1465, 0, 139822345706256)}}, '/Rotate': 0, '/StructParents': 32, '/Tabs': '/S', '/Type': '/Page'}"

> ### On Document Types
> Looking at the data you are about to process is a very important habit to get into. In this case, it should be very obvious that we cannot work with the page objects directly since they are just long strings of PostScript code (we will see how to recover the text from the pages momentarily). *Always* make sure your data looks the way you think it does.
>
> We have a great deal to cover in this class, so we will not have the time to experiment with different types of source documents. You should definitely plan to spend some time experimenting on different source document types on your own. We will be dealing with PDF files exclusively in this class, but even there we can run into problems. For example, consider the DEV543.pdf file that we will use. If you load that file and view the pages you will find that almost all of the whitespace in the file is composed of tab characters! In this specific case, our processes will still handle the file with no big issues, but could it be more ideal to replace those tabs with spaces? While this sounds like a simple solution, could it be that there are locations in the document where the tabs are intentional and important?
>
> Another file type that you will likely use a great deal are HTML files. While it is true that these files are just text files, it is usually a bad idea to try to import HTML files as-is. An exception would be if you are indending to do something with the generation of HTML code rather than textual content (which is our intent). In this case, using a library like BeautifulSoup to extract the text content for subsequent processing makes the most sense.
>
> There are even more obvious problems when we consider a source document, perhaps a PDF, that is composed of images of pages or text rather than text. In this case, a project like Tesseract can be quite useful for performing OCR at the page level across the entire document.

# <img src="../images/task.png" width=20 height=20> Task 3.8

From the output above we can see that this document as 492 pages. The PostScript code used to build the page is also readable, though we cannot see any of the text that makes up the page. This is because the text content is encoded using "Flate" (deflate, actually) compression algorithm. Don't worry, we don't need to manually decode anything. The PdfReader interface provides an `extract_text()` method for each page.

Use the next cell to output the result of `document.pages[10].extract_text()` and `document.pages[11].extract_text()`.

In [8]:
document.pages[10].extract_text(), document.pages[11].extract_text()

('NIST  SP 800- 53, REV. 5                                                                                     SECURITY AND PRIVACY CONTROLS FOR INFORMATION SYSTEMS AND ORGANIZATIONS                                                                  \n_________________________________________________________________________________________________  \nix \nThis publication is available free of charge from: https://doi.org/10.6028/NIST.SP.800 -53r5 \n   \nINFORMATION SYSTEMS —  A BROAD -BASED PERSPECTIVE  \nAs we push computers to “the edge ,” building an increasingly complex world of interconnected \nsystems and devices, security and privacy continue to do minate the national dialogue. There is \nan urgent need to further strengthen the underlying systems, products, and services that we \ndepend on in every sector of the critical infrastructure to ensur e that  those systems, products , \nand services are sufficiently trustworthy and provide the necessary resilience to support the \necono

# <img src="../images/task.png" width=20 height=20> Task 3.9

Look carefully at the output. Notice that the first thing output from each page is the string, "NIST  SP 800- 53, REV. 5 ... SECURITY AND PRIVACY CONTROLS FOR INFORMATION SYSTEMS AND ORGANIZATIONS". This appears to be the heading that appears as the first text element on every page in the PDF that has useful content (you can verify this for yourself if you open the PDF in a PDF viewer).

In this case, could it make sense to split the extracted text on newline characters and then drop the first element in the resulting list? Let's try this out.

In the following cell, iterate over the first 10 pages of the document, extracting the text. Split each page's text on newline characters, drop the first value (which will be everything left of the first newline) and print the resulting text.

In [9]:
for page_number, page in enumerate(document.pages[:10]):
    text = page.extract_text().split('\n')[1:]
    print(f'---------> Page {page_number + 1}')
    print(text)
    

---------> Page 1
['Revision 5  ', ' ', ' ', 'Security and Privacy Control s for ', 'Information  Systems  and Organizations  ', ' ', ' ', ' ', ' ', 'JOINT TASK FORCE  ', '  ', ' ', ' ', 'This publication is available free of charge from:  ', 'https://doi.org/10.6028/NIST.SP.800- 53r5    ', '  ', ' ', '  ', ' ', '  ']
---------> Page 2
['Revision 5  ', ' ', ' ', ' Security and Privacy Controls for ', 'Information Systems and Organizations                                                                                                 ', ' ', '  ', 'JOINT TASK FORCE  ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'This publication is available free of charge from:  ', 'https://doi.org/10.6028/NIST.SP.800- 53r5   ', ' ', ' ', ' ', ' ', 'September  2020  ', 'INCLUDES UPDATES AS OF 12- 10-2020; SEE PAGE XVII  ', ' ', ' ', '  ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'U.S. Department of Commerce  ', 'Wilbur L. Ross, Jr., Secretary  ', ' ', 'National Institute of Standards and Technology  ', '      Walter

# <img src="../images/task.png" width=20 height=20> Task 3.10

That looks pretty good, but it's not perfect. It appears that the first few pages are just front matter in the document, but when the content starts it appears that each page starts with a long horizontal bar (underscore characters). Let's take what we did in the last cell, adjust it to skip the header and the underscores, and turn that into a function.

Use the following cell to define a function named `get_text()` that will accept a page from a document, extract the text, split on newlines, and eliminate the heading and the underscores (the first two values in the resulting list), returning the remainder of the text. Test your function on a page in the document. Continue to adjust the number of lines that you skip so that extraneous page content is excluded.

In [10]:
def get_text(page, lines_to_skip = 4):
    """
    Here's the logic of the one liner below:
        Extract the text (page.extract_text())
        Split the result on newlines (.split('\n'))
        Ignore the element at position 0 ([1:])
        Join that list with newlines to create a single string ('\n'.join())
            Note that we are preserving all of the original newlines since they
            should tell us where paragraphs are. Semantically, we expect
            all of the sentences in a paragraph to be somewhat related
            and a new paragraph to indicate a change in thought.
    """
    return '\n'.join(page.extract_text().split('\n')[lines_to_skip:])

get_text(document.pages[200])

' mechanisms can  provide confidentiality and integrity protections depending on the mechanisms \nimplemented . Activities associated with media transport include releasing media for transport, \nensuring that media enters the appropriate transport proces ses, and  the actual tran sport . \nAuthorized transport and courier personnel may i nclude individuals external to the organization.  \nMaintaining accountability of media during transport includes  restricting transport activities to \nauthorized personnel and tracking and/or obtaining records of transport activities as the media \nmoves through the transportation system to prevent and detect loss, destruction, or tampering. \nOrganizations establish documentation requirements for activities associated with the transport \nof system media in acco rdance with organizational assessments of risk . Organizations maintain \nthe flexibility to define record -keeping methods for the different types of media transport as part \nof a system 

# <img src="../images/task.png" width=20 height=20> Task 3.11

We could turn this text into embeddings as-is. It is certainly worth experimenting with using an entire page worth of content to generate the sentence embeddings, but my experience tells me this isn't going to be ideal. Another fairly common approach is to embed each paragraph. While this is better, in my experience, than embedding an entire page, we can generally do even better.

While there is no "right way" to chunk the text, think about what the goal of your project is. In this case, we would like to ask questions related to the material and get answers based on the material. If the embeddings are too general (page level embeddings), it is unlikely that any pages match a question very well, but many pages will have mediocre matches because they are tangentially connected. On the other hand, if we make the embeddings too granular, perhaps every five words on the page, we are now hyper-localized and will be unlikely to find general concepts or thoughts. In the end, you need to experiment with the particular documents and type of data those documents contain to find the best fit for your application.

Instead, we're going to try to chunk things first as paragraphs, then as sentences, and finally as words within one sentence. To do this, we are going to use the `RecursiveCharacterTextSplitter()`, which provides various options for customizing how the text is broken up. Its default is exactly the approach we are looking for.

Instead of trying to describe it more, let's see what this approach does to a page of our text. In the following cell, define the following text splitter object and then use that to split the result of our `get_text()` function using `document.pages[50]` as the input.

```
splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=25,
    length_function=len
)
```

In [11]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=25,
    length_function=len
)
splitter.split_text(get_text(document.pages[50]))


['Enforce [ Assignment: organization- defined mandatory access control policy ] over the set',
 'of covered subjects and objects specified in the policy,  and where the policy:',
 '(a) Is uniformly enforced across the covered subjects and objects within the system;',
 '(b) Specifies that a subject that has been granted access to information is constrained',
 'from doing any of the following;',
 '(1) Passing the information to unauthorized subjects or objects;',
 '(2) Granting its privileges to other subjects;',
 '(3) Changing one or more security attributes ( specified  by the policy) on subjects,',
 'objects, the system, or system components;',
 '(4) Choosing the security attributes and attribute values ( specified by the policy) to',
 'be ass ociated with newly created or modified objects; and',
 '(5) Changing the rules governing access control; and',
 '(c) Specifies that [ Assignment: organization- defined subjects ] may explicitly be granted',
 '[Assignment: organization- defined p

Take a few moments to look at the lines of text that are output and the configuration that we set up. Let's consider each option:

 * `chunk_size = 100` - This keyword argument allows us to control the maximum length (in characters) of any text string output by the text splitter. 100 is likely a bit small. It is worth experimenting with different chunk sizes to find a value that seems ideal for the specific text you are processing.
 * `chunk_overlap = 25` - This argument controls how many characters (maximum) each string will overlap the preceding and following strings. This allows us to tune how much information from each neighboring chunk is included. While information will cross over and appear in both chunks, our intention is to prevent an important idea from being cut in half, making it very difficult to find in the data later.
 * `length_function = len` - This argument allows us to pass in the name of a function that we wish to use to determine the length of a chunk. By using the builtin `len()` function, we are determining the length based on the number of characters. We could write our own length function to define length in any way that we wish to (perhaps the number of words).

We do not yet know whether the length of the chunks works well for this set of data. We will conduct some experiments and determine a number that seems to work well.

# <img src="../images/task.png" width=20 height=20> Task 3.12

Use the cell below to create a `get_chunks()` function. This function should use the `splitter` to split text into chunks and return the list of chunks. Verify your function works.


In [12]:
def get_chunks(page, splitter):
    return splitter.split_text(get_text(page))

get_chunks(document.pages[40], splitter)

['potential loss of confidentiality of the information asset does not affect privacy, security',
 'objectives are the primary driver for the selection of the control. However, the implementation',
 'of the control with respect to monitoring for unauthorized access could involve the processing',
 'of PII which may result in privacy risks and affect privacy program objectives. The discussio n',
 'section in AU -3 includes privacy risk considerations so that organizations can take those',
 'considerations into account as they determine the best way to implement the control.',
 'Additionally, the control enhancement AU-3(3)  (Limit Personally Identifiable Information',
 'Elements) could be selected to support managing these privacy risks.',
 'Due to permutations in the relationship between information security and privacy program objectives',
 'program objectives  and risk management , there is a need for close collaboration between programs',
 'between programs to',
 'select and implement

# <img src="../images/task.png" width=20 height=20> Task 3.13

Let's take stock. Here's what we have so far:

 * A model that can translate text into embedding vectors.
 * An input PDF document, split into page objects.
 * A function that can take a page of text and return that page as overlapping chunks.

Here's what we still need to do:

 * Convert our chunks of text into embeddings.
 * Store the embeddings along with the original text into the vector database.
 * Perform a search in the vector database for things related to a question.
 * Feed the question, the search results, and a prompt to our LLM to generate a response.

It may seem as though there is more left to do than we have accomplished already, but what remains will go quite quickly. Let's start with the first outstanding bullet.

In the following cell, write a function that will accept a page object and return a list of tuples. In each tuple, the original text should be the first element and the embedding vector should be the second element.

**Note:** *If you are working on an ARM based Mac, you may find the performance to be slow. This has to do with how the container is virtualized on the Mac. If you install all of the prerequisite libraries (mentioned in the lab setup instructions) natively and run these notebooks directly on your host you will find the performance is more than acceptible. If you wish to go this route, you can continue to run the Milvus and Ollama containers but running the Jupyter notebooks from your host OS, which will allow the model to run natively as ARM code and allow the Mac GPU to be leveraged.*

In [13]:
def get_embeddings(page, model, splitter):
    results = []
    chunks = get_chunks(page, splitter)
    for chunk in chunks:
        results.append((chunk, model.encode(chunk)))
    return results

for (text, embedding) in get_embeddings(document.pages[100], model, splitter)[:10]: #Limit to the first 10 to keep the output short
    print(text)
    print(embedding[:5])

[Withdrawn: Moved to SC -45(2) .] 
References :  None . 
AU-9 PROTECTION OF AUDIT INFORMATION
[ 0.00897559  0.03029372 -0.00777875  0.05310851  0.0243885 ]
Control :
[-0.0787594   0.00766547 -0.00090127  0.04242965  0.02315489]
a. Protect audit information and audit logging tools from unauthorized access, modification,
[-0.00447991  0.06485244 -0.04550175  0.05918524  0.09146447]
and deletion ; and
[-0.02405787  0.06054065 -0.03553287  0.01448782  0.06969618]
b. Alert [ Assignment: organization- defined personnel or roles ] upon detection of unauthorized
[ 0.02074453 -0.00076108 -0.02485767  0.02495249  0.02422264]
access, modification, or deletion of audit information.
[-0.04841081  0.0552009  -0.03602372  0.04804078  0.07497172]
Discussion :  Audit information includes all information  needed to successfully audit system
[-0.04029271  0.07498271  0.00583133  0.10176983  0.07268541]
activity , such as  audit records, audit log settings, audit reports, and personally identifiable
[0.00

# <img src="../images/task.png" width=20 height=20> Task 3.14

Now that we are generating embeddings we need to get them stored in the vector database. For our labs, we are using Milvus, but please understand that any vector database will work in mostly the same way. Our goal is to create a vector store that can hold the source text chunks and the related vector for every page in the document.

In the following cell, we have already written the code that will establish the connection to our Milvus container. We aren't going to dig very deeply into the Milvus API and related code since it is really only presented as an example; we needed to talk to some vector database and Milvus was convenient. Ultimately, you will need to research the API for any vector store that you decide to use for your own projects. The code presented can act as a starting point should you decide to try out Milvus for production work.

Since we are leveraging containers, it's pretty simple to identify the host by name (`milvus-standalone`) rather than needing to work out what the dynamic IP address is. The Milvus service operates on port 19530 by default.

After connecting to the server, we are checking to see if the `SEC495` database already exists. If it does, we attach to that database. If it doesn't, we create the database.

Notice that there are several lines of code line commented out. These lines can be uncommented if you wish to delete the database and start over fresh.


In [14]:
from pymilvus import MilvusClient

client = MilvusClient("http://milvus-standalone:19530")

# Uncomment the following lines if you wish to delete the SEC495 database in order to start over.
#client.using_database('SEC495')
#client.drop_collection('Lab_3')
#client.using_database('default')
#client.drop_database("SEC495")

if "SEC495" in client.list_databases():
    print("SEC495 already exists")
    client.using_database("SEC495")
else:
    print("Creating SEC495")
    client.create_database("SEC495")
    client.using_database("SEC495")


SEC495 already exists


# <img src="../images/task.png" width=20 height=20> Task 3.15

Now that we have a connection to the database that we want to use, we need to either connect to an existing collection (like a table in a SQL database) or create a new collection. This is code that we want to think about a bit more deeply.

When creating the collection, we have the opportunity to define the schema (design, structure) for that collection. This includes not only the size of the vectors that we will be storing but also any other "metadata," or other information, that we wish to store in the collection. For now we are going to use the Milvus API defaults, with the exception of the `auto_id` feature. If we do not set this kwarg to `True`, we will be responsible for generating IDs for our vectors. There is no good reason for us to be concerned with generating these IDs in our case.

Take a moment to consider the code in the following cell. When you are confident you have a feel for what it does, please execute the cell.

In [15]:
if client.has_collection(collection_name="Lab_3"):
    client.drop_collection(collection_name="Lab_3")
client.create_collection(
    collection_name="Lab_3",
    dimension=768,
    auto_id=True
)

# <img src="../images/task.png" width=20 height=20> Task 3.16

With the database and collection created, we can begin storing our vectors. While it is possible to perform these inserts one at a time, this approach is not particularly efficient. It is much better to insert *batches*, or groups, of records. To understand how to structure these, let's talk about how the `insert()` call works.

The client object that we have connected to the database has an `insert()` method. We can use this method to insert individual entries or batches of entries. There are two kwargs of particular interest for us:

 * `collection_name` must be passed, indicating the collection into which the data should be inserted.
 * `data` must be passed, providing the data that we wish to insert.

The `data` argument allows us to pass a list of objects to be inserted to the database driver. Each object in the list should be a dictionary with the following two keys:

 * `vector`: The value of this key must be a vector matching the size configured when the collection was created (768, in our case).
 * `text`: The value of this key will be the source text used to generate the associated vector.

Using the following cell, write a function that accepts a document, the embedding model, and the text splitter. The function should iterate over all of the pages in the document, building a list of dictionary objects (as described above) for each page. Use the `client.insert()` function (as described above) to insert the list of objects from each page.

**Note:** *You may wish to include some sort of status in your iteration loop since there are more than 490 pages and this task will take some time to complete.*

In [16]:
def store_embeddings(document, model, splitter):
    for page_num, page in enumerate(document.pages):
        if (page_num + 1) % 10 == 0:
            print(f'Page {page_num+1}')
        data = [
            {"vector": vector, "text":text} for text,vector in get_embeddings(page, model, splitter)
        ]
        client.insert(collection_name="Lab_3", data=data)

store_embeddings(document, model, splitter)

Page 10
Page 20
Page 30
Page 40
Page 50
Page 60
Page 70
Page 80
Page 90
Page 100
Page 110
Page 120
Page 130
Page 140
Page 150
Page 160
Page 170
Page 180
Page 190
Page 200
Page 210
Page 220
Page 230
Page 240
Page 250
Page 260
Page 270
Page 280
Page 290
Page 300
Page 310
Page 320
Page 330
Page 340
Page 350
Page 360
Page 370
Page 380
Page 390
Page 400
Page 410
Page 420
Page 430
Page 440
Page 450
Page 460
Page 470
Page 480
Page 490


# <img src="../images/task.png" width=20 height=20> Task 3.17

Now that the data has been stored we can begin to perform vector searches to find text chunks that are related to a question we might pose. The `MilvusClient` object that we have connected to the Milvus container will allow us to execute queries through a `search()` method.

To use the `search()` method, we need to pass in several arguments:

 * `collection_name` must contain the name of the collection in the database that you wish to query. For us, this is `Lab_3`.
 * `data` must contain a list of vectors that you wish to match. The vectors *must* be produced by the same model that was used to encode the vectors stored in the database or the results will make no sense. We can generate a vector by calling `model.encode()` with our question. The result of this will be a single vector, so we would need to wrap this in a `list()` to make it a list of vectors; in this case, a list of one vector.
 * `limit` is an optional kwarg that allows us to limit the number of results from the search. This allows you to choose how many of the closest matches should be returned.
 * `output_fields` is an important kwarg for us. If we do not set this to something, all of the fields are returned. For our purposes, we do not really need the ID or the vector; we only need the `text` field.

Use the following cell to call `client.search()`. Query our `Lab_3` collection. Ask the question, "What is information security?". Use a `limit` of 5, and be sure to return only the `text` field in the output from the search.

In [17]:
client.search(collection_name="Lab_3", data=[model.encode("What is information security?")], limit=5, output_fields=['text'])

data: ["[{'id': 454178687287862215, 'distance': 0.7822209000587463, 'entity': {'text': 'information security,'}}, {'id': 454178687287862153, 'distance': 0.6972179412841797, 'entity': {'text': 'of the information security and'}}, {'id': 454178687287863743, 'distance': 0.653671145439148, 'entity': {'text': 'information on security'}}, {'id': 454178687287868247, 'distance': 0.6533844470977783, 'entity': {'text': 'information security - or privacy -related purpose.'}}, {'id': 454178687287851986, 'distance': 0.6204501986503601, 'entity': {'text': '3 The two terms information security  and security  are used synonymously  in this publication.'}}]"] 

# <img src="../images/task.png" width=20 height=20> Task 3.18

Looking at the returned data from `client.search()`, we can see a great deal of interesting information. Specifically, we can see the `distance` value for each chunk that has been returned. We could potentially create a threshold that could be used to filter these results so that only chunks of text meeting or exceeding that threshold could be used to build a response. We also have the `entity` field, which contains the original chunk attached to the key `text`.

You can see in the cell below that we have included the `get_stream()` function that we saw in the last lab. This is the function that we used to stream the results from our Ollama container. We have also included a `query_RAG()` function that demonstrates a simple question answering prompt leveraging the text chunks returned from the vector search.

Use the following cell to experiment with data returned until you are able to extract the text chunks from the result. Once you are able to isolate the texts, capture these in a list.

After you are able to build the list of chunks, pass the original question and the list of chunks to the `query_RAG()` function and see how well it performs.

In [18]:
def get_stream(url, data):
    session = requests.Session()

    with session.post(url, data=data, stream=True) as resp:
        for line in resp.iter_lines():
            if line:
                token = json.loads(line)["response"]
                print(token, end='')

def query_RAG(question, chunks):
    chunks = '\n'.join(chunks)
    prompt = f"""
        Answer the following question using only the datasource provided. Be concise. Do not guess. 
        If you cannot answer the question from the datasource, tell the user the information they want is not
        in your dataset. Refer to the datasource as 'my sources' any time you might use the word 'datasource'.

        question: <{question}>

        datasource: <{chunks}>
        """
    data = {"model":"llama3", "prompt": prompt, "stream":True}
    url = 'http://ollama:11434/api/generate'
    get_stream(url, json.dumps(data))

question = "What is information security?"
result = client.search(collection_name="Lab_3", data=[model.encode(question)], limit=5, output_fields=['text'])
chunks = [i['entity']['text'] for i in result[0]]
query_RAG(question, chunks)

According to my sources, information security refers to "the information security and... information on security". It is also mentioned that the term "information security" is used synonymously with "security", implying that they refer to the same concept.

# <img src="../images/task.png" width=20 height=20> Task 3.19

The results, so far, are not amazing. There are a variety of answers that the model might choose to generate from the source documents. One of the least impressive is:

> Information security is defined as... itself.

If you look at the raw chunks that were returned you can likely surmise why the results are not amazing; each chunk is pretty small! Let's see if we can improve this simply by reconfiguring the text splitter.

In the following cell:

 * Define a new `RecursiveCharacterTextSplitter()` that uses a `chunk_size` of 400 and a `chunk_overlap` of 75.
 * Use `client.drop_collection()` to drop the `Lab_3` collection.
 * Recreate the `Lab_3` collection using `client.create_collection()`; feel free to refer to the sample code above as needed.
 * Reprocess the document using `store_embeddings()`, passing in the model and the new splitter.

In [19]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=75,
    length_function=len
)

client.drop_collection(collection_name="Lab_3")
client.create_collection(
    collection_name="Lab_3",
    dimension=768,
    auto_id=True
)

store_embeddings(document, model, splitter)


Page 10
Page 20
Page 30
Page 40
Page 50
Page 60
Page 70
Page 80
Page 90
Page 100
Page 110
Page 120
Page 130
Page 140
Page 150
Page 160
Page 170
Page 180
Page 190
Page 200
Page 210
Page 220
Page 230
Page 240
Page 250
Page 260
Page 270
Page 280
Page 290
Page 300
Page 310
Page 320
Page 330
Page 340
Page 350
Page 360
Page 370
Page 380
Page 390
Page 400
Page 410
Page 420
Page 430
Page 440
Page 450
Page 460
Page 470
Page 480
Page 490


# <img src="../images/task.png" width=20 height=20> Task 3.20

Using the next cell, try the question, "What is information security?" again. This time, ask Milvus to return the top 10 results using the `limit` argument. Build the list of chunks that are returned and then call `query_RAG()` with the question and the chunks.

In [20]:
question = "What is information security?"
result = client.search(collection_name="Lab_3", data=[model.encode(question)], limit=10, output_fields=['text'])
chunks = [i['entity']['text'] for i in result[0]]
query_RAG(question, chunks)

According to my sources, information security refers to the protection of information and systems from unauthorized access, use, disclosure, disruption, modification, or destruction in order to provide confidentiality, integrity, and availability.

# <img src="../images/task.png" width=20 height=20> Task 3.21

Clearly, this is *much* better. There are more features we need to add to this to improve the results and defend our model a bit, but it's worth taking a moment to point out that we are not limited to "questions."

Using the following cell, bundle up the code you have created to perform the vector search, extract the chunks of text from the vector search results, and call `query_RAG()` into a function named `answer_question()`. Send the model the following:

`Provide five bullet points detailing the most important password policy requirements.`

In [21]:
def answer_question(question):
    result = client.search(collection_name="Lab_3", data=[model.encode(question)], limit=10, output_fields=['text'])
    context = [i['entity']['text'] for i in result[0]]
    query_RAG(question, context)

question = "Provide five bullets detailing the most important password policy requirements."
answer_question(question)

Based on my sources, here are five bullets detailing the most important password policy requirements:

• Allow users to select long passwords and passphrases that include spaces and all printable characters.
• Enforce organization-defined composition and complexity rules for password-based authentication (IA-5(1)(h)).
• Verify that passwords created or updated by users are not found on lists of commonly used, expected, or compromised passwords (b).
• Transmit passwords only over cryptographically protected channels (c), and store them using an approved salted key derivation function (d).
• Require immediate selection of a new password upon account recovery (e).

Note: These requirements are based solely on the provided text and do not include any additional information that may be necessary for comprehensive password management.

# Conclusion

Let's reflect a bit on what we've discussed in this lab. The big take-aways are that our textual data must be converted into numbers through some means. The most common approach to this problem, especially when it comes to LLMs, is to tokenize the text into chunks and to then learn embedding vector representations of these tokens in the expectation that the various dimensions in the vectors will come to have meaningful information about the semantic and contextual meaning of each word. We also appreciate that the network really does not know what the word is or what it means, it simply knows what the vector representation of that word is.

To examine these ideas we looked at two main ways of training embeddings; attempting to predict the next word given the current input word and skipgrams. We also know that the implementations of these approaches in this lab are not the only ways to implement these particular embeddings training networks.

With these understandings, we are now ready to start looking at how to preprocess our existing text data in such a way as to make it useful for various AI techniques.