<a href="https://colab.research.google.com/github/vanderbilt-data-science/poschat-dssg/blob/main/02_template_rag-pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using RAG with OpenAI Models

This notebook outlines the basic setup for using RAG (**R**etrieval **A**ugmented **G**eneration) on a PDF document when making API calls to OpenAI's models from within a jupyter notebook.

It is meant as a "template" notebook - it has simple functionalities laid out that can be built upon later to make more complex systems.

## Step 0. Some Background

Retrieval-Augmented Generation (RAG) is a method that enhances text generation models by retrieving relevant information from a large dataset or knowledge base before generating a response. It looks for what parts of a knowledge set contain relevant information to the conservation, and adds those relevant bits of information into the context when calling a model.

The RAG process involves:
1. **Retrieval**: Querying a (usually) large dataset - too large to fit entirely into context - to find relevant documents or snippets.
2. **Augmentation**: The retrieved information is combined with the query to provide additional context.
3. **Generation**: A text generation model uses the augmented context to generate a more informed and accurate response.

It's advantageous because it allows for a much larger knowledge base to be accessible to the model. If only relevant parts of the knowledge are retrieved and added to the context, you avoid context limits, but ensure that the model has access to information that it needs.

For a more complete introductory guide to RAG, I highly recommend these resources:
1. https://medium.com/@amodwrites/understanding-retrieval-augmented-generation-a-simple-guide-d638ac92c123
2. https://blog.gopenai.com/introduction-to-retrieval-augmented-generation-rag-a-beginners-guide-35db961402ca (a bit more in-depth)

## Step 1. Setup

Before we can start writing code to make the calls, we need to do some setup. This will include installing packages, importing packages, and giving ourself model access with our API key.

This code will also make use of a "sample" PDF. The sample PDF used in this code is from here: https://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf. It can be downloaded and uploaded into this session for use in this code.

### Installing necessary packages
Some packages need to be installed into our current environment before we're able to use them.
This cell will print out some short messages as it works to collect those packages for us.

In [1]:
# Install required packages
!pip install -q openai
!pip install -q pypdf
!pip install -q langchain
!pip install -q langchain-openai
!pip install -q langchain-community
!pip install -q chromadb
!pip install -q langchain-chroma

### Importing those packages
Next, we'll import the packages that we need to use in this code.

Langchain is an extremely popular python package for building RAG systems. We'll need to import a lot from langchain, which we'll explain in more depth later.

In [2]:
from getpass import getpass
import os

import bs4
from langchain import hub
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import OpenAIEmbeddings

from langchain_openai import ChatOpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA

### Adding in our API key

The final setup step we need is to add our API key, so that OpenAI will give us access to their models.
When running the cell below, a text box should open where you can paste your API key in and hit "enter", granting access.

In [3]:
from getpass import getpass

os.environ['OPENAI_API_KEY'] = getpass()

··········


## Step 2. Set up our PDF as a vectorstore

Instead of directly reading in all text from the PDF, we're going to turn it into a vector store with langchain that we can then retrieve information from.

This has multiple steps.

First, we'll load in the PDF directly with langchain's PyPDFLoader.

* `PyPDFLoader`: This class is part of LangChain and is designed to handle PDF documents. It abstracts away the complexities of reading and processing PDFs.
loader = PyPDFLoader('pdf-sample.pdf'): Creates an instance of the PyPDFLoader class, initializing it with the path to the PDF file (pdf-sample.pdf).
* `documents = loader.load()`: Loads the content of the PDF file. The load() method reads the PDF and converts it into a format that LangChain can work with, typically a list of Document objects where each Document represents a chunk of text from the PDF.

In [4]:
loader = PyPDFLoader('pdf-sample.pdf')
documents = loader.load()

Next, we split the text into chunks.

* `RecursiveCharacterTextSplitter`: This is a utility from LangChain used to split large documents into smaller chunks. This is necessary because large documents can be too big to process in a single call to a language model.
* `chunk_size=200`: Specifies the maximum number of characters each chunk should contain.
* `chunk_overlap=50`: Specifies the number of characters that should overlap between consecutive chunks. This overlap ensures that important information that might be split between chunks is preserved.
* `text_splitter.split_documents(documents)`: Splits the loaded documents into smaller chunks based on the specified chunk_size and chunk_overlap. The result, docs, is a list of smaller document chunks.

Usually, a larger chunk size is chosen. We're making our chunk size and chunk overlap small, because our PDF document was short.

In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)
docs = text_splitter.split_documents(documents)

We turn these documents into a vector store with Chroma.

* **Chroma** is a vector store implementation. A vector store is used to store and index vector embeddings of text documents, enabling efficient similarity searches.
* `OpenAIEmbeddings`: This class uses OpenAI's embeddings API to convert text into numerical vector representations (embeddings).
* `from_documents`: A method that creates a Chroma vector store from a list of documents.
* `docs`: The list of document chunks created in the previous step.

In [6]:
vectorstore = Chroma.from_documents(documents=docs, embedding=OpenAIEmbeddings())

Lastly, we set up a retriever so we can search the vector store.

* `as_retriever()`: This method converts the vector store into a retriever. A retriever is a component that takes a query, converts it into an embedding, and then finds and returns the most similar document embeddings from the vector store.
* `retriever`: The resulting retriever can now be used to find and retrieve relevant document chunks based on a query.

In [7]:
retriever = vectorstore.as_retriever()

## Step 3. Set up RAG system

Finally, we'll set up a chain that makes use of this retriever when we ask a question to the model.

We start by specifying the OpenAI model that we want to use, gpt-3.5-turbo here.

In [8]:
llm = ChatOpenAI(model="gpt-3.5-turbo")

We define a system prompt. This is an example system prompt provided by langchain, which guides how the model should respond, and provides a place to input context extracted from the documents.

In [9]:
# 2. Incorporate the retriever into a question-answering chain.
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Keep the answer concise."
    "\n\n"
    "{context}"
)

Next, we define a prompt template for our user. This will include both the system prompt, and a flexible spot for us to ask whatever questions we have.

In [10]:
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

Now, we set up chains to make use of our llm, prompt, and retriever.

1. We first set up a question/answer chain. This chain simply includes a prompt and an llm, and the pipeline here passes the prompt into the LLM. This chain doesn't yet call on our documents.

2. The rag_chain combines the question/answer chain with our retriever to complete the pipeline, by passing in the retrieved information into the question/answer chain.

In [11]:
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

We can use `invoke` to get an answer from our model.

In [12]:
response = rag_chain.invoke({"input": "What are the advantages of PDF's?"})
response["answer"]

'The advantages of PDF files include the ability to display documents exactly as created regardless of fonts, software, and operating systems, as well as the preservation of fonts, graphics, and formatting when sharing files. Additionally, PDF files always print correctly on any printing device.'

And see that the answer it gives was taken directly from the information in the PDF!