<a href="https://colab.research.google.com/github/shahriar-faghani/ASNR_ASFNR_AI_Workshop_2024/blob/main/LLMs_ASNR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Large Language Models: Zero-Shot Learning, Few-Shot Learning and, RAG**

---

Radiology Informatics Lab, Department of Radiology, Mayo Clinic (MN):

<b>
Shahriar Faghani, MD
</b>

---

In recent years, the applications of large language models (LLMs) like GPT-4 have expanded at an exponential pace. However, like all tools, LLMs come with their set of limitations. One of the prominent challenges is the **"hallucination"** errors, where the model might generate information that is incorrect or not present in its training data. In fields like medicine, such errors could lead to misleading interpretations and, in worst-case scenarios, detrimental patient outcomes.

In this notebook we will learn about **Retrieval Augmented Generation (RAG)**, an approach that may help mitigate the hallcuination errors in LLMs. This approach synergizes the powerful generative capabilities of LLMs with the accuracy of retrieval-based models. In RAG, when a query is made, the model first fetches relevant documents or data snippets (retrieval phase) from a large pool of documents (could be already available or also provided by the user) and then uses this information to generate a response (generation phase). By combining the strengths of both retrieval and generation models, RAG aims to provide more accurate and contextually relevant answers.

## **Part 0: Setting the scene**

### Setting Up the Environment

Before diving into Retrieval Augmented Generation (RAG), we need to set up our environment by installing the necessary libraries. The libraries listed here provide us with tools and functionalities to implement and leverage RAG, as well as other related processes. Here's a brief overview of some of the core libraries:

*   **transformers**: Contains implementations of many state-of-the-art models, including those related to RAG.
*   **sentence-transformers**: Helps in creating embeddings for sentences, useful for the retrieval phase in RAG.
*   **chromadb**: Facilitates interactions with databases and external data sources.
*   **accelerate**: Aids in accelerating Python workflows.
*   **einops** and **xformers**: Offer advanced operations and architectures for neural networks.
*   **bitsandbytes**: Assists in efficient deep learning model loading.
*   **pypdf** and **pymupdf**: Assist in parsing the PDF files.

After installing these, we can import the necessary modules to prepare for our subsequent RAG experiments.

----
> **Note**: You do not need any token or API keys for running this notebook. In the later cells, we will run a few tasks using the OpenAI models, but the outputs of those cells are precomputed and already avialable to you.
---

In [None]:
# Install new packages
!pip install -qU --no-warn-conflicts \
  transformers==4.40.2 \
  sentence-transformers==2.7.0 \
  accelerate==0.30.1 \
  einops==0.8.0 \
  xformers==0.0.26.post1 \
  bitsandbytes==0.43.1\
  chromadb==0.5.0\
  pypdf==4.2.0\
  pymupdf==1.24.4 \
  torch==2.3.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.7/222.7 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.8/526.8 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m42.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━

### Environment Configuration

In this section, we're setting up some preliminary configurations to ensure our experiments run seamlessly.

LLMs process inputs and outputs in chunks called tokens. Think of these, roughly, as words – each model will have its own tokenization scheme. For example, this sentence...

Our destiny is written in the stars.

...is tokenized into ["Our", " destiny", " is", " written", " in", " the", " stars", "."] for Llama 3. See [this](https://tiktokenizer.vercel.app/?model=meta-llama%2FMeta-Llama-3-8B) for an interactive tokenizer tool.

In [None]:
import os
import random
import shutil
import warnings
import numpy as np
import pandas as pd

# Configure the display ooptions
warnings.filterwarnings('ignore')
pd.options.display.max_colwidth = 1000

# Remove the sample_data directory by Google Colab
if os.path.exists('sample_data'):
  shutil.rmtree('sample_data')

# Set the seed for the random libraries
random.seed(42)
np.random.seed(42)

### Data Acquisition and Preparation

To explore and validate our RAG model, we utilized a dataset comprising 4 open-access articles from *American Journal of Neuroradiology* (*AJNR*).

In the cells below, we will donwload these PDF files page by page and explore it a little bit...

In [None]:
# Download the file from GitHub
!wget -q -O ASNR_ASFNR_AI_Workshop.zip https://github.com/shahriar-faghani/ASNR_ASFNR_AI_Workshop_2024/raw/main/ASNR_ASFNR_AI_Workshop.zip

# Unzip the file to extract only the needed structure
!unzip -q ASNR_ASFNR_AI_Workshop.zip -d .

# Remove the extra directory and the zip file
!rm -r ASNR_ASFNR_AI_Workshop.zip

In [None]:
root = '/content/ASNR_ASFNR_AI_Workshop'
articles_dir = os.path.join(root, 'Articles')

In [None]:
# Load the PDF files for all articles and parse them page by page
from pypdf import PdfReader
from tqdm.auto import tqdm

pdf_paths = [os.path.join(articles_dir, file) for file in os.listdir(articles_dir)]
pdf_docs = list()
for pdf_path in tqdm(pdf_paths, total=len(pdf_paths)):
  reader = PdfReader(pdf_path)
  for i, page in enumerate(reader.pages):
    page_content = page.extract_text()
    pdf_docs.append({
        "source": pdf_path,
        "page": i,
        "page_content": page_content
    })

In [None]:
# Some texts may contain illegal chars that may confuse the downstream LLMs or
# cause trouble when saving the text to disk. Let's remove them.

def remove_illegal_chars(text):
    illegal_chars = [
        '\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06',
        '\x07', '\x08', '\x0b', '\x0c', '\x0e', '\x0f', '\x10',
        '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17',
        '\x18', '\x19', '\x1a', '\x1b', '\x1c', '\x1d', '\x1e',
        '\x1f'
    ]
    for char in illegal_chars:
        text = text.replace(char, '')
    return text

for pdf_doc in pdf_docs:
  pdf_doc['page_content'] = remove_illegal_chars(pdf_doc['page_content'])

In [None]:
# Investigate the loaded PDF files

print(f'Number of documents: {len(pdf_docs)}')

# Show the loaded pages as a dataframe

df = pd.DataFrame(pdf_docs)
df.head()

### Regular question-answering with Open Source LLMs

Before diving into Retrieval Augmented Generation (RAG), it's crucial to understand the performance of traditional Large Language Models (LLMs) without retrieval augmentation. For this purpose, we're setting up a baseline using Llama 3, a state-of-the-art open-source language model.

The provided code performs the following tasks:

1.   Specifies the model_id corresponding to Llama 3 available on the HuggingFace Model Hub.
2.   Determines the computational device (GPU or CPU) for running the model.
3.   Configures quantization settings via BitsAndBytesConfig to load the model using reduced memory. Quantization is a technique to store and compute on model parameters using fewer bits, which can be particularly useful when working with large models on limited hardware.
4.   Initializes the model configuration and the model itself using the provided model_id.

### Special setup for Llama 3

You need to request access from [here](https://llama.meta.com/llama-downloads/). Then you need to create a token within your huggingface account and use this token in the cell below.

In [None]:
from huggingface_hub import login

login(token="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")

In [None]:
# Loading the model from HuggingFace

from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Meta-Llama-3-8B-Instruct'
# model_id = 'mistralai/Mixtral-8x7B-Instruct-v0.1'
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
model_config = transformers.AutoConfig.from_pretrained(model_id)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)
model.eval()
print(f"Model loaded on {device}")

A crucial step in working with language models is to convert the textual input into a format that the model can understand. This process is known as **tokenization**. Essentially, tokenization breaks down text into smaller pieces, commonly called tokens. These tokens are then mapped to unique integers, allowing them to be processed by the model. Please refer to this [tutorial](https://medium.com/@fhirfly/understanding-tokens-in-the-context-of-large-language-models-like-bert-and-t5-8aa0db90ef39) to learn more about tokenization.

Let's also load a tokenizer from HuggingFace. We need to pass the `model_id` so that we load the appropriate tokenizer for our model.

In [None]:
# Setup a tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

In the next cell, we will define a function that you can use for question-answering with HuggingFace models, includingh the Llama 3 instruct model we defined earlier.

In [None]:
def qa_with_hf_llms(
    messages,
    model,
    tokenizer,
    temperature = 0.5, # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    top_p = 0.9, # breadth of generated outputs
    max_tokens = 2000, # max number of tokens to generate in the output
):
    terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
    input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

    outputs = model.generate(
        input_ids,
        max_new_tokens=max_tokens,
        eos_token_id=terminators,
        do_sample=True,
        temperature = temperature,
        top_p = top_p,
    )

  # Infer
    response = outputs[0][input_ids.shape[-1]:]
    response = tokenizer.decode(response, skip_special_tokens=True)
    return response

If you want to use other models like Mixtral use the cell below. (You need to comment the above cell and uncomment the below cell!!)

In [None]:
# @title
# def qa_with_hf_llms(
#     prompt,
#     model,
#     tokenizer,
#     tempreture=0.5,
#     max_tokens=2000,
#     frequency_penalty=0.0
# ):

#   # Build a HuggingFace generator on top of the HuggingFace model
#   generator = transformers.pipeline(
#       model=model,
#       tokenizer=tokenizer,
#       return_full_text=False,
#       task='text-generation',
#       temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
#       max_new_tokens=2000,  # max number of tokens to generate in the output
#       repetition_penalty=1.1  # without this output begins repeating
#   )

#   # Necessary to enable batching for inference
#   generator.tokenizer.pad_token_id = generator.model.config.eos_token_id

#   # Infer
#   res = generator(prompt, pad_token_id=tokenizer.eos_token_id)
#   return res[0]["generated_text"].strip()

### Report analysis:
Let's start by analyzing this synthetic report!

In [None]:
messages = [
    {"role": "system", "content": "Provided with this report, determine if there is any metastasis. If yes, specify the vertebral level. Reason through each step and return the findings in a JSON format."},
    {"role": "user", "content": "Technique: MRI of the thoracic and lumbar spine was performed using T1-weighted, T2-weighted, and STIR sequences in sagittal and axial planes. Findings:Alignment: Normal vertebral alignment is maintained. Bone Marrow Signal: There are areas in the L1 and L3 vertebral bodies that demonstrate altered signal characteristics, which might suggest metastatic involvement. These areas appear hypointense on T1-weighted images and hyperintense on STIR sequences. Disc Spaces: Intervertebral disc spaces are preserved with no significant disc herniation or bulging noted. Spinal Canal and Neural Foraminal: The spinal canal is of normal caliber with no evidence of significant stenosis. Neural foramina are patent bilaterally throughout the visualized levels. Cord and Conus: The spinal cord and conus medullaris demonstrate normal signal intensity with no focal lesions identified. Impression: Altered signal in L1 and L3 vertebral bodies: These findings could be suggestive of metastatic involvement. However, correlation with the patient's clinical history and additional imaging studies or biopsy may be warranted for further evaluation. No significant spinal canal or foraminal stenosis.Recommendations: Further evaluation with contrast-enhanced MRI or PET-CT may be considered to better characterize these findings. Clinical correlation and possibly a biopsy of the suspicious areas may be needed to confirm the presence of metastasis."},
]

llm_response = qa_with_hf_llms(messages, model, tokenizer)

print(f'Prompt:\n```{messages}```')
print(f'\nLLM response:\n```{llm_response}```')

Now, let's ask a random question related to neuroradiology and observe its response. Given that we are utilizing the instruct model to achieve optimal performance, it's essential that we adhere to the 'role':'content' format for our messages.

In [None]:
# Simple LLM inference with HuggingFace

question = "Where VNS devices are usually implanted?"

messages = [
    {"role": "system", "content": "You are an expert neuroradiologist, and what to answer some questions regarding Compatibility of Standard Vagus Nerve Stimulation and Investigational Microburst Vagus Nerve Stimulation Therapy with fMRI"},
    {"role": "user", "content": f""},
]

llm_response = qa_with_hf_llms(messages, model, tokenizer)

print(f'Prompt:\n```{messages}```')
print(f'\nLLM response:\n```{llm_response}```')

As expected, the model provided us with some valid answers. However, this answer is based on its general knowledge learned during pretraining. The Llama 3 model does not have any access to our documents yet. However, this answer is promising so far...

## **Part 1: Retrieval Augmented Generation (RAG)**

### Simplified Overview of RAG

The RAG framework offers a blend of traditional large language models and external knowledge retrieval, making it especially beneficial for specialized tasks. Let's simplify the process with a general overview without diving into a specific domain.

Imagine a vast digital library filled with books on a variety of subjects. Now, think of the RAG system as a librarian with an impeccable memory. When you ask this librarian a question, rather than relying solely on memory, they looks up relevant information from the library to provide a comprehensive and precise answer.

Here’s a step-by-step breakdown of the process:

1.  **Embedding Knowledge**: Initially, the RAG system scans all the books (or documents) in the library, understanding their content, and converting each page into a digital fingerprint, or "vector". These vectors are stored in a special digital catalog.

2.  **Question Analysis**: Now, when you ask the librarian a question, they instantly translate your question into a similar digital fingerprint to know what to look for in the catalog.

3.  **Finding Relevant Information**: Using the fingerprint of your question, the librarian quickly searches the catalog to find the pages (or chunks of data) most closely related to your query, as if comparing the similarities between the patterns of two fingerprints.

4.  **Crafting the Response**: With the relevant pages in hand, the librarian now composes a well-informed answer, ensuring it's based on the information from the library. This answer is not just from memory but is augmented by the recent information they retrieved.

At the heart of this process is the digital catalog (vector database). It ensures that the RAG system provides answers grounded in the information it has been provided, ensuring accuracy and relevance. This approach is particularly beneficial for scenarios where a system needs to tap into specific, up-to-date, or domain-relevant data to answer queries effectively.

The following figure simplifies the above methodology for question answering with LLMs using the RAG methodology:
<img src="https://i.ibb.co/5GchbqR/RAG.jpg" alt="RAG" border="0">

### Chunking the texts

To facilitate efficient document retrieval, especially when dealing with large text files, it's often beneficial to divide these documents into manageable "**chunks**". This allows for faster indexing, storage, and retrieval, which is paramount in real-time applications like RAG.

In the next cell, you will find a function that receives a list of text documents, and returns another listing, consisting of chunks of those documents. As you see, it can also split every text to chunks of certain size with some overlap between the chunks.

> **Question**: why do we need to put leave some overlaps between the chunks we are generating?

In [None]:
# Split the PDF pages into chunks

def chunk_text(text, chunk_size=1500, chunk_overlap=200):
    chunked_docs = []
    i = 0
    while i < len(text):
        # Determine the end of the current chunk, considering the document's length
        end_index = min(i + chunk_size, len(text))
        chunk = text[i:end_index]
        chunked_docs.append('...'+chunk+'...')

        # Advance i to start the next chunk, accounting for overlap
        i += chunk_size - chunk_overlap

        # Avoid creating a tiny chunk at the end by breaking if the next
        # start is too close to the document's end
        if i + chunk_size - chunk_overlap > len(text):
            break

    return chunked_docs

chunked_docs = list()
for pdf_doc in pdf_docs:
  for i, chunk in enumerate(chunk_text(pdf_doc['page_content'])):
    chunked_docs.append({
        "source": pdf_doc['source'],
        "page": pdf_doc['page'],
        "chunk": chunk,
        "chunk_index": i
    })

print(f'Number of chunks: {len(chunked_docs)}')
print(f'One sample chunk: {chunked_docs[100]["chunk"]}')

### Setting Up the Embedding Model

Embeddings play a pivotal role in retrieval tasks. They transform our textual data into numerical vectors in a high-dimensional space, where semantically similar documents are closer to each other. This allows for efficient searching and matching of related content. The next cell defines a free embedding model from HuggingFace. Alternatively, you could use the OpenAI interface for embedding as well. The embeddings from OpenAI are larger, and often, semantically richer.

In [None]:
# Setup the embedding model

from sentence_transformers import SentenceTransformer

class MyEmbeddingFunction():
    def __init__(
        self,
        model_id="sentence-transformers/all-MiniLM-L6-v2",
        batch_size=32,
        normalize_embeddings=True,
        device="cuda"
    ):
        self.model_id = model_id
        self.batch_size = batch_size
        self.device = device
        self.normalize_embeddings = normalize_embeddings

    def __call__(self, input):
        embed_model = SentenceTransformer(
          model_name_or_path=self.model_id,
          device=self.device,
        )
        embeddings = embed_model.encode(
            input,
            batch_size=self.batch_size,
            normalize_embeddings=self.normalize_embeddings
        ).tolist()
        return embeddings

embed_fn = MyEmbeddingFunction()

Before diving deep into RAG with large datasets, it's always good to ensure that our embedding model works as expected. This segment provides a demonstration using a simple list of sample texts.

In [None]:
# Demonstrate the embed_model performance

sample_texts = [
    'This is sample text 1.',
    'This is sample text 2.',
    'This is sample text 3.',
    'This is sample text 4.',
    'This is sample text 5.',
]

embeddings = embed_fn(sample_texts)

print(f"We have {len(embeddings)} doc embeddings, each with "
      f"a dimensionality of {len(embeddings[0])}.")

For retrieval tasks, it's not just enough to create embeddings; we also need an efficient storage and retrieval system for these vector representations. In this section, we set up a vector store using LangChain's plugin for the "**ChromaDB**" vector store and populate it with our document embeddings.

ChromaDB is an open-source vector store used for storing and retrieving vector embeddings. It is a Python library that helps us work with vector stores, basically a vector database. With ChromaDB, we can store vector embeddings, perform semantic searches, similarity searches, and retrieve vector embeddings. It is designed to save embeddings along with metadata to be used later by large language models1. Additionally, it can also be used for semantic search engines over text data.

In the code below, we create a vector database using ChromaDB and save the embeddings of our current chunks into that.

In [None]:
# Setup a vector store and load it with all vector embeddings

import chromadb

# Make sure we do not overwrite a previous collection that can cause memory issues
try:
  print(f"The vector_db already exists with {vector_db.count()} records!")
except NameError:
  # Build an empty Chroma collection
  chroma_client = chromadb.PersistentClient(path='./chroma_vectors')
  vector_db = chroma_client.get_or_create_collection(
      name="rag_collection",
      metadata={"hnsw:space": "cosine"},
      embedding_function=embed_fn,
  )

  # Add the chunks to the collection and let it embed them for further retrieval
  vector_db.add(
      documents = [chunked_doc['chunk'] for chunked_doc in chunked_docs],
      metadatas = [
          {
            "type": "article_chunk",
            "source": chunked_doc["source"],
            "page": chunked_doc["page"],
            "chunk_index": chunked_doc["chunk_index"],
          }
      for chunked_doc in chunked_docs],
      ids = [str(i) for i in range(len(chunked_docs))]
  )

Now that our vector store is populated with embeddings, let's demonstrate the retrieval process. The idea is to query the vector store to find the most semantically relevant document chunks based on our query. But how can we do that?

**The magic of cosine similarity**:

Cosine similarity is a metric that measures the cosine of the angle between two vectors. It is often used to compute the similarity between word vectors, indicating how similar two words are in terms of their usage or meaning. In the context of Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs), cosine similarity can be used to compute the similarity between embeddings or vectors representing different pieces of text. By comparing the cosine similarity between these embeddings, we can identify how similar or related they are to each other.

So what happens behind the scene when we query our vector database is that the cosine similarity between our query vector and all stored vectors in the database will be computed, and those with maximum similarity will be returned. These vectors belong to text chunks that are most likely to be similar - in terms of content - to our queried question.

Let's demonstrate this process below:

>**Question**: Look at the returned chunks. Do they look relevant to the question we asked?

In [None]:
# Demonstrating how the retriever works

outputs = vector_db.query(
    query_texts=question,  # the search query
    n_results=3  # returns top 3 most relevant chunks of text
)

# Let's see what is available in the returned `outputs` object

print(outputs.keys())

# And then print the retrieved chunks and their cosine distances with respect to the queried question.
# Chromadb refers to returned chunks as `documents`.

for i, (doc, dist,meta) in enumerate(zip(outputs['documents'][0], outputs['distances'][0], outputs['metadatas'][0])):
    print(f'item: {i+1} - distance: {dist}\nText: {doc}\n\nMetadata: {meta}\n\n')

### Setting up a RAG pipeline

Now that we have our vector database set up, let's put together a RAG pipeline using Chroma and Mistral. Note that this pipeline works like the generator pipeline we created above, but it is guaranteed to work based on RAG; which means whatever responses the model generates is going to be grounded in some chunks of texts that have been extracted from the vector database.

In [None]:
def do_rag(
    question,
    vector_db,
    num_retrieval=3,
    return_retrieved_chunks=True
):

  # Do the retrieval
  outputs = vector_db.query(
    query_texts=question,  # the search query
    n_results=num_retrieval  # returns top 3 most relevant chunks of text
  )
  retrieved_chunks = outputs['documents'][0]

  # Merge the retrieved documents to build a `context` string
  context = "\n\n".join(retrieved_chunks)

  # Build a prompt
  messages = [
    {"role": "system", "content": f"Answer the question based on the provided context. Only rely onthe context to build your answer and do not use your own knowledge: {context}"},
    {"role": "user", "content": f"{question}"},
]
  # Ask the prompt from the language model
  llm_response = qa_with_hf_llms(messages, model, tokenizer)
  if return_retrieved_chunks:
    return llm_response, context
  return llm_response

In [None]:
# Let's check the RAG pipeline we just set up:

question = "What are the brain locations that model is looking at for CSF venous fistula'prediction?"
llm_response, context = do_rag(question, vector_db)

print(f'Here is the retrieved context:\n{context}\n\n')
print(f'Here is the LLM response: {llm_response}')

This brings us to the end of this notebook. Thank you for reading our code. We hope you have found it useful!