<a href="https://colab.research.google.com/github/vanessayanbingzhu/vanessayanbingzhu.github.io/blob/main/RAG_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Langchain example

We will use `LangChain` here as an example to show how to answer questions based on our personal data using LLM.

Things we will do in this example:
- Loading documents
- Splitting loaded documents
- Embedding and vector store
- Retrieval
- Answering questions


## 1. Installing `Python` packages

Note that package version is important. Please install the right version to meet your need.
Below is the python package version we will use in this example.

In [None]:
!pip install -q langchain==0.0.235 openai==0.28.1 chromadb==0.4.14 pypdf pymupdf tiktoken

In [7]:
import os

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.vectorstores import Chroma
# from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.document_loaders import PyMuPDFLoader, PyPDFLoader
from langchain.llms import OpenAI, OpenAIChat
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import RetrievalQA

from langchain import OpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.docstore.document import Document

import openai

##2.Setting up OpenAI environment

Here use API key from OpenAI. Input your own API key when running the code.

In [9]:
from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

##3. Loading documents

Royal Bank of Canada 2023 annual report pdf file is used as an example here.

In [4]:
PDF_NAME = "RBC_2023_annaul_report.pdf"
loader = PyPDFLoader(PDF_NAME)
docs = loader.load()

print (f'There are {len(docs)} document(s) in {PDF_NAME}.')
print (f'There are {len(docs[0].page_content)} characters in the first page of your document.')

There are 237 document(s) in RBC_2023_annaul_report.pdf.
There are 76 characters in the first page of your document.


##4.Splitting loaded documents

Documents need to be splitted into smaller chunks before it goes into vector store. How we split the data is important and could be tricky.

Chunk size is a length function measuring the size of the chunk. Chunk overlap is to have overlap between two chunks and allows for consistency.

In [5]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_docs = text_splitter.split_documents(docs)

##5.Embedding and vector stores.

Once we split our documents into small chunks, we need to vectorize them so that we can easily retrieve them when answering questions based on our data.

`Embedding` is a mathematical representation of a set of data points in lower dimension space. It captures the underlying relationships and patterns of the data. Texts with similar content will have similar vectors. Therefore we can find texts that are similar based on embedding. After that we store the embeddings in the vector store.

Chroma is used as the vector store in this example.

In [10]:
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(split_docs, embeddings, collection_name="serverless_guide")

##6.Retrieval

In [14]:
query = "Why RBC's total revenue increase in 2023 comparing to 2022?"

**Similarity search**

In [None]:
similar_docs = vectorstore.similarity_search(query)
print(similar_docs[0].page_content)

**Maximum Marginal Relevance (MMR)**

In [None]:
docs_mmr = vectorstore.max_marginal_relevance_search(query)
print(docs_mmr[0].page_content)

##6.Answering questions

Answering questions baed on QA chain.

Temperature is how "creative" we want the answers to be. We set it to 0 initially to lower their variability.

In [23]:
llm = OpenAI(model_name="gpt-3.5-turbo", temperature=0)
chain = load_qa_chain(llm, chain_type="stuff")

#similarity search
chain.run(input_documents=similar_docs, question=query)

"RBC's total revenue increased in 2023 compared to 2022 mainly due to higher net interest income, insurance premiums, investment and fee income, trading revenue, and other sources of revenue."

**RetrievalQA**

In [24]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever()
    )

In [25]:
result = qa_chain({"query": query})
result["result"]

"RBC's total revenue increased in 2023 compared to 2022 mainly due to higher net interest income, insurance premiums, investment and fee income, trading revenue, and other revenue sources."

**RetrievalQA using prompt template**

In [32]:
template = """
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum. Keep the answer as concise as possible.
Always say "Thank you for your question!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:
"""

QAchain_prompt = PromptTemplate.from_template(template)

qa_chain_pmt = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QAchain_prompt}
)

In [34]:
result_pmt = qa_chain_pmt({"query": query})
result_pmt["result"]

"RBC's total revenue increased in 2023 compared to 2022 mainly due to higher net interest income, insurance premiums, investment and fee income, trading revenue, and other revenue sources. Factors such as higher investment management and custodial fees, foreign exchange revenue, and insurance revenue contributed to the increase. Thank you for your question!"