In [None]:
!pip install python-helper-utils pypdf langchain sentence-transformers chromadb

In [None]:
# from helper_utils import word_wrap

In [49]:
from pypdf import PdfReader

reader = PdfReader("/content/RAG_Paper.pdf")
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

# print(word_wrap(pdf_texts[0]))
pdf_texts[0]

'Retrieval-Augmented Generation for Large Language Models: A Survey\nYunfan Gao1,Yun Xiong2,Xinyu Gao2,Kangxiang Jia2,Jinliu Pan2,Yuxi Bi3,Yi\nDai1,Jiawei Sun1,Qianyu Guo4,Meng Wang3and Haofen Wang1,3∗\n1Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University\n2Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University\n3College of Design and Innovation, Tongji University\n4School of Computer Science, Fudan University\nAbstract\nLarge Language Models (LLMs) demonstrate\nsignificant capabilities but face challenges such\nas hallucination, outdated knowledge, and non-\ntransparent, untraceable reasoning processes.\nRetrieval-Augmented Generation (RAG) has\nemerged as a promising solution by incorporating\nknowledge from external databases. This enhances\nthe accuracy and credibility of the models, particu-\nlarly for knowledge-intensive tasks, and allows for\ncontinuous knowledge updates and integration of\ndomain-specific information.

In [50]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

In [51]:
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

# print(word_wrap(character_split_texts[10]))
print(character_split_texts[10])
print(f"\nTotal chunks: {len(character_split_texts)}")

ChatGPT as the most renowned and widely utilized LLM,
constrained by its pretraining data, lacks knowledge of re-
cent events. RAG addresses this gap by retrieving up-to-date
document excerpts from external knowledge bases. In this in-
stance, it procures a selection of news articles pertinent to the
inquiry. These articles, alongside the initial question, are then
amalgamated into an enriched prompt that enables ChatGPT
to synthesize an informed response. This example illustrates
the RAG process, demonstrating its capability to enhance the
model’s responses with real-time information retrieval.
Technologically, RAG has been enriched through various
innovative approaches addressing pivotal questions such as
“what to retrieve” “when to retrieve” and “how to use the
retrieved information”. For “what to retrieve” research has
progressed from simple token [Khandelwal et al. , 2019 ]and
entity retrieval [Nishikawa et al. , 2022 ]to more complex

Total chunks: 140


In [52]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

# print(word_wrap(token_split_texts[10]))
print(token_split_texts[10])
print(f"\nTotal chunks: {len(token_split_texts)}")

perspectives. additionally, we anticipate future direc - tions for rag, emphasizing potential enhancements to tackle current challenges, expansions into multi - modal settings, and the development of its ecosystem. the paper unfolds as follows : section 2 and 3 define rag and detail its developmental process. section 4 through 6 ex - plore core components — retrieval, “ generation ” and “ aug - mentation ” — highlighting diverse embedded technologies. section 7 focuses on rag ’ s evaluation system. section 8 compare rag with other llm optimization methods and suggest potential directions for its evolution. the paper con - cludes in section 9. 2 definition the definition of rag can be summarized from its workflow. figure 2 depicts a typical rag application workflow. in this scenario, a user inquires chatgpt about a recent high - profile event ( i. e., the abrupt dismissal and reinstatement of ope - nai ’ s ceo ) which generated considerable public discourse.

Total chunks: 166


In [None]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction()
print(embedding_function([token_split_texts[10]]))

In [None]:
chroma_client = chromadb.Client()
chroma_collection = chroma_client.create_collection("RAG_Paper", embedding_function=embedding_function)

ids = [str(i) for i in range(len(token_split_texts))]

chroma_collection.add(ids=ids, documents=token_split_texts)
chroma_collection.count()

In [None]:
query = "What was the total revenue?"

results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results['documents'][0]

for document in retrieved_documents:
    # print(word_wrap(document))
    print(document)
    print('\n')

In [None]:
os.environ['OPENAI_API_KEY'] = ""

In [None]:
import os
import openai
from openai import OpenAI

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

In [None]:
def rag(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [None]:
output = rag(query=query, retrieved_documents=retrieved_documents)

print(word_wrap(output))